This is a learning note of this series of videos.
Paper Link: https://arxiv.org/abs/2010.02502
1. Background Background:
DDPM Pro: high quality image generation
DDPM Con: require many steps for sampling
Goal:
Accelerate sampling process
How:
Generalize the forward diffusion process used by DDPMs, which is Markovian, to non-Markovian ones. The resulting training objective is the same as the objective used in DDPM. Therefore, we can choose from existing generative models and only modify the sampling process.
DDIM can massively increase sample efficiency only at a minor cost in sample quality.
Empirical Benefits of DDIMs over DDPMs :
Sampling can be accelerated by 10x to 100x. Sample Consistency: when the initial latent variable is fixed and generate several samples with Markov chains of various lengths, the resulting samples have similar high-level features. Due to consistency, DDIMs can perform semantically meaningful image interpolation by manipulating the initial latent variables.
2. Denoising Objective of DDPM Suppose the true data distribution is q ( x 0 ) q(\bold{x}_0) q ( x 0 ) , we want to learn an approximate distribution p θ ( x 0 ) p_{\theta}(\bold{x}_0) p θ ( x 0 ) which is close to q ( x 0 ) q(\bold{x}_0) q ( x 0 ) .
DDPMs are latent variable models of the form
p θ ( x 0 ) = ∫ p θ ( x 0 : T ) d x 1 : T , where p θ ( x 0 : T ) ≔ p θ ( x T ) ∏ t = 1 T p θ ( t ) ( x t − 1 ∣ x t ) p_{\theta}(\bold{x}_0) = \int p_{\theta} (\bold{x}_{0:T}) d \bold{x}_{1:T}, \quad \text{where} \quad p_{\theta} (\bold{x}_{0:T}) \coloneqq p_{\theta} (\bold{x}_T) \prod_{t=1}^T p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t) p θ ( x 0 ) = ∫ p θ ( x 0 : T ) d x 1 : T , where p θ ( x 0 : T ) : = p θ ( x T ) t = 1 ∏ T p θ ( t ) ( x t − 1 ∣ x t ) where x 1 , … , x T \bold{x}_1, \dots, \bold{x}_T x 1 , … , x T are latent variables. The parameters θ \theta θ are learned to fit the data distribution q ( x 0 ) q(\bold{x}_0) q ( x 0 ) by maximizing a variational lower bound:
max θ E q ( x 0 ) [ log p θ ( x 0 ) ] ≥ max θ E q ( x 0 , x 1 , … , x T ) [ log p θ ( x 0 : T ) − log q ( x 1 : T ∣ x 0 ) ] \max_{\theta} \mathbb{E}_{q(\bold{x}_0)}[\log p_{\theta}(\bold{x}_0)] \geq \max_{\theta} \mathbb{E}_{q(\bold{x}_0, \bold{x}_1, \dots, \bold{x}_T)} [\log p_{\theta}(\bold{x}_{0:T}) - \log q(\bold{x}_{1:T} | \bold{x}_0)] θ max E q ( x 0 ) [ log p θ ( x 0 )] ≥ θ max E q ( x 0 , x 1 , … , x T ) [ log p θ ( x 0 : T ) − log q ( x 1 : T ∣ x 0 )] where q ( x 1 : T ∣ x 0 ) q(\bold{x}_{1:T}|\bold{x}_0) q ( x 1 : T ∣ x 0 ) is some inference distribution (the forward process). DDPM consider it as the following Markov chain with Gaussian transitions parameterized by a decreasing sequence α 1 : T ∈ ( 0 , 1 ] T \alpha_{1:T} \in (0, 1]^T α 1 : T ∈ ( 0 , 1 ] T :
q ( x 1 : T ∣ x 0 ) ≔ ∏ t = 1 T q ( x t ∣ x t − 1 ) , where q ( x t ∣ x t − 1 ) ≔ N ( α t α t − 1 x t − 1 , ( 1 − α t α t − 1 ) I ) q(\bold{x}_{1:T}|\bold{x}_0) \coloneqq \prod_{t=1}^T q(\bold{x}_t | \bold{x}_{t-1}), \text{where } q(\bold{x}_t | \bold{x}_{t-1}) \coloneqq \mathcal{N} \left( \sqrt{\frac{\alpha_t}{\alpha_{t-1}}} \bold{x}_{t-1}, (1 - \frac{\alpha_t}{\alpha_{t-1}}) \bold{I} \right) q ( x 1 : T ∣ x 0 ) : = t = 1 ∏ T q ( x t ∣ x t − 1 ) , where q ( x t ∣ x t − 1 ) : = N ( α t − 1 α t x t − 1 , ( 1 − α t − 1 α t ) I ) A special property of the forward process is that:
q ( x t ∣ x 0 ) ≔ ∫ q ( x 1 : t ∣ x 0 ) d x 1 : ( t − 1 ) = N ( x t , α t x 0 , ( 1 − α t ) I ) ; q(\bold{x}_t | \bold{x}_0) \coloneqq \int q(\bold{x}_{1:t} | \bold{x}_0) d \bold{x}_{1:(t-1)} = \mathcal{N} (\bold{x}_t, \sqrt{\alpha_t} \bold{x}_0, (1 - \alpha_t) \bold{I}); q ( x t ∣ x 0 ) : = ∫ q ( x 1 : t ∣ x 0 ) d x 1 : ( t − 1 ) = N ( x t , α t x 0 , ( 1 − α t ) I ) ; Note: we need to make sure α T → 0 \alpha_T \rightarrow 0 α T → 0 , so that the data becomes standard Gaussian after adding noise.
By reparameterization trick, x t \bold{x}_t x t can be expressed as a linear combination of x 0 \bold{x}_0 x 0 and a noise variable ϵ \epsilon ϵ :
x t = α t x 0 + 1 − α t ϵ , where ϵ ∼ N ( 0 , I ) \bold{x}_t = \sqrt{\alpha_t} \bold{x}_0 + \sqrt{1 - \alpha_t} \epsilon, \qquad \text{where} \quad \epsilon \sim \mathcal{N}(\bold{0}, \bold{I}) x t = α t x 0 + 1 − α t ϵ , where ϵ ∼ N ( 0 , I ) If all the conditionals are modeled as Gaussians with trainable mean functions and fixed variances, the variation lower bound objective can be simplified to:
L γ ( ϵ θ ) ≔ ∑ t = 1 T γ t E x 0 ∼ q ( x 0 ) , ϵ t ∼ N ( 0 , I ) [ ∣ ∣ ϵ θ ( t ) ( α t x 0 + 1 − α t ϵ t ) − ϵ t ∣ ∣ 2 2 ] L_{\gamma} (\epsilon_{\theta}) \coloneqq \sum_{t=1}^T \gamma_t \mathbb{E}_{\bold{x}_0 \sim q(\bold{x}_0), \epsilon_t \sim \mathcal{N}(\bold{0}, \bold{I})} [||\epsilon_{\theta}^{(t)} (\sqrt{\alpha_t} \bold{x}_0 + \sqrt{1 - \alpha_t} \epsilon_t) - \epsilon_t||_2^2] L γ ( ϵ θ ) : = t = 1 ∑ T γ t E x 0 ∼ q ( x 0 ) , ϵ t ∼ N ( 0 , I ) [ ∣∣ ϵ θ ( t ) ( α t x 0 + 1 − α t ϵ t ) − ϵ t ∣ ∣ 2 2 ] where ϵ θ ≔ { ϵ θ ( t ) } t = 1 T \epsilon_{\theta} \coloneqq \{ \epsilon_{\theta}^{(t)}\}_{t=1}^T ϵ θ : = { ϵ θ ( t ) } t = 1 T is a set of T T T functions, each ϵ θ ( t ) : X → X \epsilon_{\theta}^{(t)}: \mathcal{X} \rightarrow \mathcal{X} ϵ θ ( t ) : X → X is a function with trainable parameters θ ( t ) \theta^{(t)} θ ( t ) , and γ ≔ [ γ 1 , … , γ T ] \gamma \coloneqq [\gamma_1, \dots, \gamma_T] γ : = [ γ 1 , … , γ T ] is a vector of positive coefficients in the objective that depends on α 1 : T \alpha_{1:T} α 1 : T .
3. Why DDIM uses non-Markovian forward process Note that in the loss of DDPM, ∣ ∣ ϵ θ ( t ) ( α t x 0 + 1 − α t ϵ t ) − ϵ t ∣ ∣ 2 2 ||\epsilon_{\theta}^{(t)} (\sqrt{\alpha_t} \bold{x}_0 + \sqrt{1 - \alpha_t} \epsilon_t) - \epsilon_t||_2^2 ∣∣ ϵ θ ( t ) ( α t x 0 + 1 − α t ϵ t ) − ϵ t ∣ ∣ 2 2 only relies on the “overall step” distribution q ( x t ∣ x 0 ) q(\bold{x}_t | \bold{x}_0) q ( x t ∣ x 0 ) , but not directly on q ( x 1 : T ∣ x 0 ) q(\bold{x}_{1:T} | \bold{x}_0) q ( x 1 : T ∣ x 0 ) .
There are many choices of inference distribution q ( x 1 : T ∣ x 0 ) q(\bold{x}_{1:T} | \bold{x}_0) q ( x 1 : T ∣ x 0 ) that have the same q ( x t ∣ x 0 ) q(\bold{x}_t | \bold{x}_0) q ( x t ∣ x 0 ) .
If we can keep the overall step q ( x t ∣ x 0 ) q(\bold{x}_t | \bold{x}_0) q ( x t ∣ x 0 ) unchanged, then the objective L γ L_{\gamma} L γ doesn’t need to change(don’t need to train the model again).
Non-Markovian Forward Processes for a Discrete Case The non-Markovian forward process works beyond Gaussian cases.(Appendix A)
Consider a categorical observation x 0 \bold{x}_0 x 0 that is a one-hot vector with K possible values, the forward process is defined as follow. First, we have q ( x t ∣ x 0 ) q(\bold{x}_t | \bold{x}_0) q ( x t ∣ x 0 ) as the following categorical distribution:
q ( x t ∣ x 0 ) = Cat ( α t x 0 + ( 1 − α ) 1 K ) q(\bold{x}_t | \bold{x}_0) = \text{Cat}(\alpha_t \bold{x}_0 + (1 - \alpha) \bold{1}_K) q ( x t ∣ x 0 ) = Cat ( α t x 0 + ( 1 − α ) 1 K ) where 1 K ∈ R K \bold{1}_K \in \mathbb{R}^K 1 K ∈ R K is a vector with all entries being 1 / K 1 / K 1/ K , and α t \alpha_t α t decreaseing from α 0 = 1 \alpha_0 = 1 α 0 = 1 to α T = 0 \alpha_T = 0 α T = 0 . Then we define q ( x t − 1 ∣ x t , x 0 ) q(\bold{x}_{t-1} | \bold{x}_t , \bold{x}_0) q ( x t − 1 ∣ x t , x 0 ) as the following mixture distribution:
q ( x t − 1 ∣ x t , x 0 ) = { Cat ( x t ) with probability σ t Cat ( x 0 ) with probability ( α t − 1 − σ t α t ) Cat ( 1 K ) with probability ( 1 − α t − 1 ) − ( 1 − α t ) σ t q(\bold{x}_{t-1} | \bold{x}_t , \bold{x}_0) = \begin{cases} \text{Cat}(\bold{x}_t) & \text{with probability } \sigma_t \\ \text{Cat}(\bold{x}_0) & \text{with probability } (\alpha_{t-1} - \sigma_t \alpha_t ) \\ \text{Cat}(\bold{1}_K) & \text{with probability }(1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t\end{cases} q ( x t − 1 ∣ x t , x 0 ) = ⎩ ⎨ ⎧ Cat ( x t ) Cat ( x 0 ) Cat ( 1 K ) with probability σ t with probability ( α t − 1 − σ t α t ) with probability ( 1 − α t − 1 ) − ( 1 − α t ) σ t or equivalently:
q ( x t − 1 ∣ x t , x 0 ) = Cat ( σ t x t + ( α t − 1 − σ t α t ) x 0 + ( ( 1 − α t − 1 ) − ( 1 − α t ) σ t ) 1 K ) q(\bold{x}_{t-1} | \bold{x}_t , \bold{x}_0) = \text{Cat}(\sigma_t \bold{x}_t + (\alpha_{t-1} - \sigma_t \alpha_t ) \bold{x}_0 + ((1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t) \bold{1}_K) q ( x t − 1 ∣ x t , x 0 ) = Cat ( σ t x t + ( α t − 1 − σ t α t ) x 0 + (( 1 − α t − 1 ) − ( 1 − α t ) σ t ) 1 K ) This definition is consistent with how we defined q ( x t ∣ x 0 ) q(\bold{x}_t | \bold{x}_0) q ( x t ∣ x 0 ) .
Why consistent?
Plug in x t = α t x 0 + ( 1 − α t ) 1 K \bold{x}_t = \alpha_t \bold{x}_0 + (1 - \alpha_t) \bold{1}_K x t = α t x 0 + ( 1 − α t ) 1 K into σ t x t + ( α t − 1 − σ t α t ) x 0 + ( ( 1 − α t − 1 ) − ( 1 − α t ) σ t ) 1 K \sigma_t \bold{x}_t + (\alpha_{t-1} - \sigma_t \alpha_t ) \bold{x}_0 + ((1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t) \bold{1}_K σ t x t + ( α t − 1 − σ t α t ) x 0 + (( 1 − α t − 1 ) − ( 1 − α t ) σ t ) 1 K , we get:
σ t x t + ( α t − 1 − σ t α t ) x 0 + ( ( 1 − α t − 1 ) − ( 1 − α t ) σ t ) 1 K = σ t [ α t x 0 + ( 1 − α t ) 1 K ] + ( α t − 1 − σ t α t ) x 0 + ( ( 1 − α t − 1 ) − ( 1 − α t ) σ t ) 1 K = σ t α t x 0 + σ t ( 1 − α t ) 1 K + α t − 1 x 0 − σ t α t x 0 + ( 1 − α t − 1 ) 1 K − ( 1 − α t ) σ t 1 K = α t − 1 x 0 + ( 1 − α t − 1 ) 1 K \begin{align*} & \sigma_t \bold{x}_t + (\alpha_{t-1} - \sigma_t \alpha_t ) \bold{x}_0 + ((1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t) \bold{1}_K \\ =& \sigma_t [\alpha_t \bold{x}_0 + (1 - \alpha_t) \bold{1}_K] + (\alpha_{t-1} - \sigma_t \alpha_t ) \bold{x}_0 + ((1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t) \bold{1}_K \\ =& \sigma_t \alpha_t \bold{x}_0 + \sigma_t(1 - \alpha_t) \bold{1}_K + \alpha_{t-1}\bold{x}_0 - \sigma_t \alpha_t \bold{x}_0 + (1 - \alpha_{t-1})\bold{1}_K - (1 - \alpha_t) \sigma_t \bold{1}_K \\ =& \alpha_{t-1} \bold{x}_0 + (1 - \alpha_{t-1}) \bold{1}_K\end{align*} = = = σ t x t + ( α t − 1 − σ t α t ) x 0 + (( 1 − α t − 1 ) − ( 1 − α t ) σ t ) 1 K σ t [ α t x 0 + ( 1 − α t ) 1 K ] + ( α t − 1 − σ t α t ) x 0 + (( 1 − α t − 1 ) − ( 1 − α t ) σ t ) 1 K σ t α t x 0 + σ t ( 1 − α t ) 1 K + α t − 1 x 0 − σ t α t x 0 + ( 1 − α t − 1 ) 1 K − ( 1 − α t ) σ t 1 K α t − 1 x 0 + ( 1 − α t − 1 ) 1 K End
Similarly, we can define the reverse process p θ ( x t − 1 ∣ x t ) p_{\theta}(\bold{x}_{t-1} | \bold{x}_t) p θ ( x t − 1 ∣ x t ) as:
p θ ( x t − 1 ∣ x t ) = Cat ( σ t x t + ( α t − 1 − σ t α t ) f θ ( t ) ( x t ) + [ ( 1 − α t − 1 ) − ( 1 − α t ) σ t ] 1 K ) p_{\theta}(\bold{x}_{t-1} | \bold{x}_t) = \text{Cat} \left( \sigma_t \bold{x}_t + (\alpha_{t-1} - \sigma_t \alpha_t) f_{\theta}^{(t)}(\bold{x}_t) + [(1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t ] \bold{1}_K \right) p θ ( x t − 1 ∣ x t ) = Cat ( σ t x t + ( α t − 1 − σ t α t ) f θ ( t ) ( x t ) + [( 1 − α t − 1 ) − ( 1 − α t ) σ t ] 1 K ) where f θ ( t ) ( x t ) f_{\theta}^{(t)}(\bold{x}_t) f θ ( t ) ( x t ) maps x t \bold{x}_t x t to a K-dimensional vector. As ( 1 − α t − 1 ) − ( 1 − α t ) σ t → 0 (1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t \rightarrow 0 ( 1 − α t − 1 ) − ( 1 − α t ) σ t → 0 , the sampling process becomes less stochastic.
The objective will be the KL divergence:
D KL ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) D_{\text{KL}}(q(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) || p_{\theta} (\bold{x}_{t-1} | \bold{x}_t)) D KL ( q ( x t − 1 ∣ x t , x 0 ) ∣∣ p θ ( x t − 1 ∣ x t )) which is simply the KL divergence between tow categoricals. Since the KL divergence is convex, we can apply Jensen Inequality here and get the objective upperbound:
D KL ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) ≤ σ t D KL ( Cat ( x t ) ∣ ∣ Cat ( x t ) ) + ( α t − 1 − σ t α t ) D KL ( Cat ( x 0 ) ∣ ∣ Cat ( f θ ( t ) ( x t ) ) ) + [ ( 1 − α t − 1 ) − ( 1 − α t ) σ t ] Cat ( 1 K ) = 0 + ( α t − 1 − σ t α t ) D KL ( Cat ( x 0 ) ∣ ∣ Cat ( f θ ( t ) ( x t ) ) ) + 0 = ( α t − 1 − σ t α t ) D KL ( Cat ( x 0 ) ∣ ∣ Cat ( f θ ( t ) ( x t ) ) ) \begin{align*}& D_{\text{KL}}(q(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) || p_{\theta} (\bold{x}_{t-1} | \bold{x}_t)) \\ \leq & \sigma_t D_{\text{KL}}(\text{Cat}(\bold{x}_t) || \text{Cat}(\bold{x}_t)) + (\alpha_{t-1} - \sigma_t \alpha_t) D_{\text{KL}}( \text{Cat}(\bold{x}_0)|| \text{Cat}( f_{\theta}^{(t)}(\bold{x}_t))) + [(1 - \alpha_{t-1}) - (1 - \alpha_t) \sigma_t] \text{Cat}(\bold{1}_K) \\ =& 0 + (\alpha_{t-1} - \sigma_t \alpha_t) D_{\text{KL}}( \text{Cat}(\bold{x}_0)|| \text{Cat}( f_{\theta}^{(t)}(\bold{x}_t))) + 0 \\ =& (\alpha_{t-1} - \sigma_t \alpha_t) D_{\text{KL}}( \text{Cat}(\bold{x}_0)|| \text{Cat}( f_{\theta}^{(t)}(\bold{x}_t)))\end{align*} ≤ = = D KL ( q ( x t − 1 ∣ x t , x 0 ) ∣∣ p θ ( x t − 1 ∣ x t )) σ t D KL ( Cat ( x t ) ∣∣ Cat ( x t )) + ( α t − 1 − σ t α t ) D KL ( Cat ( x 0 ) ∣∣ Cat ( f θ ( t ) ( x t ))) + [( 1 − α t − 1 ) − ( 1 − α t ) σ t ] Cat ( 1 K ) 0 + ( α t − 1 − σ t α t ) D KL ( Cat ( x 0 ) ∣∣ Cat ( f θ ( t ) ( x t ))) + 0 ( α t − 1 − σ t α t ) D KL ( Cat ( x 0 ) ∣∣ Cat ( f θ ( t ) ( x t ))) This also shows that how σ t \sigma_t σ t changes doesn’t affect the objective (up to re-weighting).
4. The Core Sampling Method of DDIM (and why) Non-Markovian Forward Processes Consider a family Q \mathcal{Q} Q of inference distributions, indexed by a real vector σ ∈ R ≥ 0 T \sigma \in \mathbb{R}^T_{\geq 0} σ ∈ R ≥ 0 T
q σ ( x 1 : T ∣ x 0 ) ≔ q σ ( x T ∣ x 0 ) ∏ t = 2 T q σ ( x t − 1 ∣ x t , x 0 ) q_{\sigma}(\bold{x}_{1:T} | \bold{x}_0) \coloneqq q_{\sigma}(\bold{x}_T | \bold{x}_0) \prod_{t=2}^T q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) q σ ( x 1 : T ∣ x 0 ) : = q σ ( x T ∣ x 0 ) t = 2 ∏ T q σ ( x t − 1 ∣ x t , x 0 ) where q σ ( x T ∣ x 0 ) = N ( α T x 0 , ( 1 − α T ) I ) q_{\sigma}(\bold{x}_T | \bold{x}_0) = \mathcal{N} (\sqrt{\alpha_T} \bold{x}_0 , (1 - \alpha_T) \bold{I}) q σ ( x T ∣ x 0 ) = N ( α T x 0 , ( 1 − α T ) I ) and for all t > 1 t > 1 t > 1 ,
q σ ( x t − 1 ∣ x t , x 0 ) = N ( α t − 1 x 0 + 1 − α t − 1 − σ t 2 ⋅ x t − α t x 0 1 − α t , σ t 2 I ) q_{\sigma}(\bold{x}_{t-1} | \bold{x}_{t}, \bold{x}_0) = \mathcal{N} (\sqrt{\alpha_{t-1}} \bold{x}_0 + \sqrt{1 - \alpha_{t-1} - \sigma_t^2} \cdot \frac{\bold{x}_t - \sqrt{\alpha_t} \bold{x}_0}{\sqrt{1 - \alpha_t}}, \sigma_t^2 \bold{I}) q σ ( x t − 1 ∣ x t , x 0 ) = N ( α t − 1 x 0 + 1 − α t − 1 − σ t 2 ⋅ 1 − α t x t − α t x 0 , σ t 2 I ) The mean function is chosen to ensure that q σ ( x t ∣ x 0 ) = N ( α t x 0 , ( 1 − α t ) I ) q_{\sigma} (\bold{x}_t | \bold{x}_0) = \mathcal{N} (\sqrt{\alpha_t} \bold{x}_0, (1 - \alpha_t) \bold{I}) q σ ( x t ∣ x 0 ) = N ( α t x 0 , ( 1 − α t ) I ) for all t t t , so that the joint inference distribution matches the marginals as desired (matches the original DDPM’s overall step so that don’t need to train the model again ).
How the sampling process is defined? (1) DDPM Sampling : x t → x t − 1 → ⋯ → x 0 \bold{x}_{t} \rightarrow \bold{x}_{t-1} \rightarrow \dots \rightarrow \bold{x}_{0} x t → x t − 1 → ⋯ → x 0
p ( x t − 1 ∣ x t , x 0 ) = p ( x t ∣ x t − 1 , x 0 ) ⋅ p ( x t − 1 ∣ x 0 ) p ( x t ∣ x 0 ) p(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) = \frac{p(\bold{x}_t | \bold{x}_{t-1}, \bold{x}_0) \cdot p(\bold{x}_{t-1} | \bold{x}_0)}{p(\bold{x}_t | \bold{x}_0)} p ( x t − 1 ∣ x t , x 0 ) = p ( x t ∣ x 0 ) p ( x t ∣ x t − 1 , x 0 ) ⋅ p ( x t − 1 ∣ x 0 ) The overall steps p ( x t ∣ x 0 ) p(\bold{x}_t | \bold{x}_0) p ( x t ∣ x 0 ) and p ( x t − 1 ∣ x 0 ) p(\bold{x}_{t-1} | \bold{x}_0) p ( x t − 1 ∣ x 0 ) are fixed, and p ( x t ∣ x t − 1 , x 0 ) p(\bold{x}_t | \bold{x}_{t-1}, \bold{x}_0) p ( x t ∣ x t − 1 , x 0 ) can be simplified by Markov property to p ( x t ∣ x t − 1 ) p(\bold{x}_t | \bold{x}_{t-1}) p ( x t ∣ x t − 1 ) .
(2) Skip Steps (non-Markovian)
Consider s < k s < k s < k , the sampling distribution can be written as:
p ( x s ∣ x k , x 0 ) = p ( x k ∣ x s , x 0 ) ⋅ p ( x s ∣ x 0 ) p ( x k ∣ x 0 ) p(\bold{x}_{s} | \bold{x}_k, \bold{x}_0) = \frac{p(\bold{x}_k | \bold{x}_{s}, \bold{x}_0) \cdot p(\bold{x}_{s} | \bold{x}_0)}{p(\bold{x}_k | \bold{x}_0)} p ( x s ∣ x k , x 0 ) = p ( x k ∣ x 0 ) p ( x k ∣ x s , x 0 ) ⋅ p ( x s ∣ x 0 ) The overall steps still are still known and doesn’t change:
p ( x t ∣ x 0 ) = N ( α ˉ t x 0 , ( 1 − α ˉ t ) I ) , in DDPM notation p(\bold{x}_t | \bold{x}_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t} \bold{x}_0, (1 - \bar{\alpha}_t) \bold{I}), \qquad \text{in DDPM notation} p ( x t ∣ x 0 ) = N ( α ˉ t x 0 , ( 1 − α ˉ t ) I ) , in DDPM notation Now the problem is how to set the sampling distribution (undetermined coefficients method).
We still set p ( x s ∣ x k , x 0 ) p(\bold{x}_s | \bold{x}_k, \bold{x}_0) p ( x s ∣ x k , x 0 ) to a Gaussian. Assume that the mean function is a linear combination of x k \bold{x}_k x k and x 0 \bold{x}_0 x 0 :
p ( x s ∣ x k , x 0 ) = N ( n x 0 + m x k , σ 2 I ) p(\bold{x}_s | \bold{x}_k, \bold{x}_0) = \mathcal{N} (n \bold{x}_0 + m\bold{x}_k, \sigma^2 \bold{I}) p ( x s ∣ x k , x 0 ) = N ( n x 0 + m x k , σ 2 I ) where n , m , σ 2 n, m, \sigma^2 n , m , σ 2 are unknown and to be determined.
Reparameterization gives:
x s = ( n x 0 + m x k ) + σ ϵ ′ , where ϵ ′ ∼ N ( 0 , I ) \bold{x}_s = (n \bold{x}_0 + m \bold{x}_k) + \sigma \epsilon', \quad \text{where } \epsilon' \sim \mathcal{N}(\bold{0}, \bold{I}) x s = ( n x 0 + m x k ) + σ ϵ ′ , where ϵ ′ ∼ N ( 0 , I ) Reparameterization of the overall step gives:
x t = α ˉ t x 0 + 1 − α ˉ t ϵ ′ ′ , where ϵ ′ ′ ∈ N ( 0 , I ) \bold{x}_t = \sqrt{\bar{\alpha}_t} \bold{x}_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon'', \quad \text{where } \epsilon'' \in \mathcal{N}(\bold{0}, \bold{I}) x t = α ˉ t x 0 + 1 − α ˉ t ϵ ′′ , where ϵ ′′ ∈ N ( 0 , I ) Therefore we can derive:
x s = ( n x 0 + m x k ) + σ ϵ ′ = ( n x 0 + m ( α ˉ k x 0 + 1 − α ˉ k ϵ ′ ′ ) ) + σ ϵ ′ = ( n + m α ˉ k ) x 0 + [ m 1 − α ˉ k ϵ ′ ′ + σ ϵ ′ ] = ( n + m α ˉ k ) x 0 + m 2 ( 1 − α ˉ k ) + σ 2 ϵ , where ϵ ∼ N ( 0 , I ) \begin{align*}\bold{x}_s &= (n \bold{x}_0 + m\bold{x}_k) + \sigma \epsilon' \\ &= (n\bold{x}_0 + m(\sqrt{\bar{\alpha}_k} \bold{x}_0 + \sqrt{1 - \bar{\alpha}_k} \epsilon'')) + \sigma \epsilon' \\ &= (n + m \sqrt{\bar{\alpha}_k}) \bold{x}_0 + [m\sqrt{1 - \bar{\alpha}_k} \epsilon'' + \sigma \epsilon'] \\ &= (n + m \sqrt{\bar{\alpha}_k}) \bold{x}_0 + \sqrt{m^2 (1 - \bar{\alpha}_k) + \sigma^2} \epsilon , \quad \text{where } \epsilon \sim \mathcal{N}(\bold{0}, \bold{I})\end{align*} x s = ( n x 0 + m x k ) + σ ϵ ′ = ( n x 0 + m ( α ˉ k x 0 + 1 − α ˉ k ϵ ′′ )) + σ ϵ ′ = ( n + m α ˉ k ) x 0 + [ m 1 − α ˉ k ϵ ′′ + σ ϵ ′ ] = ( n + m α ˉ k ) x 0 + m 2 ( 1 − α ˉ k ) + σ 2 ϵ , where ϵ ∼ N ( 0 , I ) The overall step also gives x s \bold{x}_s x s :
x s = x ˉ s x 0 + 1 − α ˉ s ϵ ′ ′ , where ϵ ′ ′ ∈ N ( 0 , I ) \bold{x}_s = \sqrt{\bar{x}_s} \bold{x}_0 + \sqrt{1 - \bar{\alpha}_s} \epsilon'', \quad \text{where } \epsilon'' \in \mathcal{N}(\bold{0}, \bold{I}) x s = x ˉ s x 0 + 1 − α ˉ s ϵ ′′ , where ϵ ′′ ∈ N ( 0 , I ) Using the method of undetermined coefficients, we have:
{ n + m α ˉ k = α ˉ s m 2 ( 1 − α ˉ k ) + σ 2 = 1 − α ˉ s \begin{cases} n + m\sqrt{\bar{\alpha}_k} = \sqrt{\bar{\alpha}_s} \\ m^2(1 - \bar{\alpha}_k) + \sigma^2 = 1 - \bar{\alpha}_s\end{cases} { n + m α ˉ k = α ˉ s m 2 ( 1 − α ˉ k ) + σ 2 = 1 − α ˉ s Here we choose σ \sigma σ as the free variable and solve n n n and m m m :
{ m = 1 − α ˉ s − σ 2 1 − α ˉ k n = α ˉ s − 1 − α ˉ s − σ 2 ⋅ α ˉ k 1 − α ˉ k \begin{cases}m = \frac{\sqrt{1 - \bar{\alpha}_s - \sigma^2}}{\sqrt{1 - \bar{\alpha}_k}} \\ n = \sqrt{\bar{\alpha}_s} - \sqrt{1 - \bar{\alpha}_s - \sigma^2} \cdot \frac{\sqrt{\bar{\alpha}_k}}{\sqrt{1 - \bar{\alpha}_k}} \end{cases} { m = 1 − α ˉ k 1 − α ˉ s − σ 2 n = α ˉ s − 1 − α ˉ s − σ 2 ⋅ 1 − α ˉ k α ˉ k So we have:
p ( x s ∣ x k , x 0 ) = N ( ( α ˉ s − 1 − α ˉ s − σ 2 ⋅ α ˉ k 1 − α ˉ k ) x 0 + 1 − α ˉ s − σ 2 1 − α ˉ k x k , σ 2 I ) = N ( α ˉ s x 0 + 1 − α ˉ s − σ 2 ⋅ x k − α ˉ k x 0 1 − α ˉ k , σ 2 I ) \begin{align*}p(\bold{x}_s | \bold{x}_k, \bold{x}_0) &= \mathcal{N} ((\sqrt{\bar{\alpha}_s} - \sqrt{1 - \bar{\alpha}_s - \sigma^2} \cdot \frac{\sqrt{\bar{\alpha}_k}}{\sqrt{1 - \bar{\alpha}_k}}) \bold{x}_0 + \frac{\sqrt{1 - \bar{\alpha}_s - \sigma^2}}{\sqrt{1 - \bar{\alpha}_k}}\bold{x}_k, \sigma^2 \bold{I}) \\ &= \mathcal{N}(\sqrt{\bar{\alpha}_s} \bold{x}_0 + \sqrt{1 - \bar{\alpha}_s - \sigma^2} \cdot \frac{\bold{x}_k - \sqrt{\bar{\alpha}_k} \bold{x}_0}{\sqrt{1 - \bar{\alpha}_k}}, \sigma^2 \bold{I})\end{align*} p ( x s ∣ x k , x 0 ) = N (( α ˉ s − 1 − α ˉ s − σ 2 ⋅ 1 − α ˉ k α ˉ k ) x 0 + 1 − α ˉ k 1 − α ˉ s − σ 2 x k , σ 2 I ) = N ( α ˉ s x 0 + 1 − α ˉ s − σ 2 ⋅ 1 − α ˉ k x k − α ˉ k x 0 , σ 2 I ) Change the DDPM notation to DDIM notaion:
p ( x s ∣ x k , x 0 ) = N ( α s x 0 + 1 − α s − σ 2 ⋅ x k − α k x 0 1 − α k , σ 2 I ) p(\bold{x}_s | \bold{x}_k, \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_s} \bold{x}_0 + \sqrt{1 - \alpha_s - \sigma^2} \cdot \frac{\bold{x}_k - \sqrt{\alpha_k} \bold{x}_0}{\sqrt{1 - \alpha_k}}, \sigma^2 \bold{I}) p ( x s ∣ x k , x 0 ) = N ( α s x 0 + 1 − α s − σ 2 ⋅ 1 − α k x k − α k x 0 , σ 2 I ) When σ = 0 \sigma = 0 σ = 0 , it’s DDIM. When σ = σ DDPM \sigma = \sigma_{\text{DDPM}} σ = σ DDPM , it’s DDPM. One can also interpolate in between.
5. DDIM’s Overall Step is Unchanged (Appendix B, Lemma 1)
Lemma 1. For q σ ( x 1 : T ∣ x 0 ) ≔ q σ ( x T ∣ x 0 ) ∏ t = 2 T q σ ( x t − 1 ∣ x t , x 0 ) q_{\sigma}(\bold{x}_{1:T} | \bold{x}_0) \coloneqq q_{\sigma}(\bold{x}_T | \bold{x}_0) \prod_{t=2}^T q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) q σ ( x 1 : T ∣ x 0 ) : = q σ ( x T ∣ x 0 ) ∏ t = 2 T q σ ( x t − 1 ∣ x t , x 0 ) and q σ ( x t − 1 ∣ x t , x 0 ) = N ( α t − 1 x 0 + 1 − α t − 1 − σ t 2 ⋅ x t − α t x 0 1 − α t , σ t 2 I ) q_{\sigma}(\bold{x}_{t-1} | \bold{x}_{t}, \bold{x}_0) = \mathcal{N} (\sqrt{\alpha_{t-1}} \bold{x}_0 + \sqrt{1 - \alpha_{t-1} - \sigma_t^2} \cdot \frac{\bold{x}_t - \sqrt{\alpha_t} \bold{x}_0}{\sqrt{1 - \alpha_t}}, \sigma_t^2 \bold{I}) q σ ( x t − 1 ∣ x t , x 0 ) = N ( α t − 1 x 0 + 1 − α t − 1 − σ t 2 ⋅ 1 − α t x t − α t x 0 , σ t 2 I ) , we have:
q σ ( x t ∣ x 0 ) = N ( α t x 0 , ( 1 − α t ) I ) q_{\sigma} (\bold{x}_t | \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_t} \bold{x}_0, (1 - \alpha_t) \bold{I}) q σ ( x t ∣ x 0 ) = N ( α t x 0 , ( 1 − α t ) I ) Proof:
Prove by induction for t t t from T T T to 1 1 1 .
By definition of q σ ( x T ∣ x 0 ) q_{\sigma}(\bold{x}_T | \bold{x}_0) q σ ( x T ∣ x 0 ) , t = T t = T t = T holds.
Assume for any t ≤ T t \leq T t ≤ T , q σ ( x t ∣ x 0 ) = N ( α t x 0 , ( 1 − α t ) I ) q_{\sigma} (\bold{x}_t | \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_t} \bold{x}_0, (1 - \alpha_t) \bold{I}) q σ ( x t ∣ x 0 ) = N ( α t x 0 , ( 1 − α t ) I ) holds.
First, we have:
q σ ( x t − 1 ∣ x 0 ) = ∫ x t q σ ( x t ∣ x 0 ) q σ ( x t − 1 ∣ x t , x 0 ) d x t q_{\sigma} (\bold{x}_{t-1} | \bold{x}_0) = \int_{\bold{x}_t} q_{\sigma} (\bold{x}_{t} | \bold{x}_0) q_{\sigma} (\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) d \bold{x}_t q σ ( x t − 1 ∣ x 0 ) = ∫ x t q σ ( x t ∣ x 0 ) q σ ( x t − 1 ∣ x t , x 0 ) d x t and
q σ ( x t ∣ x 0 ) = N ( α t x 0 , ( 1 − α t ) I ) q_{\sigma} (\bold{x}_{t} | \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_t} \bold{x}_0, (1 - \alpha_t) \bold{I}) q σ ( x t ∣ x 0 ) = N ( α t x 0 , ( 1 − α t ) I ) q σ ( x t − 1 ∣ x t , x 0 ) = N ( α t − 1 x 0 + 1 − α t − 1 − σ t 2 ⋅ x t − α t x 0 1 − α t , σ t 2 I ) q_{\sigma}(\bold{x}_{t-1} | \bold{x}_{t}, \bold{x}_0) = \mathcal{N} (\sqrt{\alpha_{t-1}} \bold{x}_0 + \sqrt{1 - \alpha_{t-1} - \sigma_t^2} \cdot \frac{\bold{x}_t - \sqrt{\alpha_t} \bold{x}_0}{\sqrt{1 - \alpha_t}}, \sigma_t^2 \bold{I}) q σ ( x t − 1 ∣ x t , x 0 ) = N ( α t − 1 x 0 + 1 − α t − 1 − σ t 2 ⋅ 1 − α t x t − α t x 0 , σ t 2 I ) 📖
Bishop (2006) (2.115)
If we have a Gaussian distribution for x \bold{x} x and a conditional Gaussian distribution for y \bold{y} y give x \bold{x} x in the form
p ( x ) = N ( x ∣ μ , Λ − 1 ) p(\bold{x}) = \mathcal{N}(\bold{x} | \mu, \Lambda^{-1}) p ( x ) = N ( x ∣ μ , Λ − 1 ) p ( y ∣ x ) = N ( y ∣ A x + b , L − 1 ) p(\bold{y} | \bold{x}) = \mathcal{N} (\bold{y} | \bold{Ax} + \bold{b}, \bold{L^{-1}}) p ( y ∣ x ) = N ( y ∣ Ax + b , L − 1 ) the marginal distribution of y \bold{y} y and the conditional distribution of x \bold{x} x given y \bold{y} y are given by:
p ( y ) = N ( y ∣ A μ + b , L − 1 + A Λ − 1 A T ) p(\bold{y}) = \mathcal{N} (\bold{y} | \bold{A \mu} + \bold{b}, \bold{L^{-1}} + \bold{A \Lambda^{-1} A^{T}} ) p ( y ) = N ( y ∣ A μ + b , L − 1 + A Λ − 1 A T ) p ( x ∣ y ) = N ( x ∣ Σ [ A T L ( y − b ) + Λ μ , Σ ] ) p(\bold{x} | \bold{y}) = \mathcal{N}(\bold{x} | \Sigma [\bold{A^T L (y - b) + \Lambda \mu}, \Sigma]) p ( x ∣ y ) = N ( x ∣Σ [ A T L ( y − b ) + Λ μ , Σ ]) where Σ = ( Λ + A T L A ) − 1 \Sigma = (\Lambda + \bold{A^T L A})^{-1} Σ = ( Λ + A T LA ) − 1 .
From Bishop (2006) (2.115), we have that q σ ( x t − 1 ∣ x 0 ) q_{\sigma} (\bold{x}_{t-1} | \bold{x}_0) q σ ( x t − 1 ∣ x 0 ) is Gaussian, denoted as N ( μ t − 1 , Σ t − 1 ) \mathcal{N}(\mu_{t-1}, \Sigma_{t-1}) N ( μ t − 1 , Σ t − 1 )
where
μ t − 1 = α t − 1 x 0 + 1 − α t − 1 − σ t 2 ⋅ α t x 0 − α t x 0 1 − α t = α t − 1 x 0 \begin{align}
\mu_{t-1}
&= \sqrt{\alpha_{t-1}} \bold{x}_0 + \sqrt{1 - \alpha_{t-1} - \sigma_t^2} \cdot \frac{\sqrt{\alpha}_t \bold{x}_0 - \sqrt{\alpha}_t \bold{x}_0}{\sqrt{1 - \alpha_t}} \\
&= \sqrt{\alpha_{t-1}} \bold{x}_0
\end{align} μ t − 1 = α t − 1 x 0 + 1 − α t − 1 − σ t 2 ⋅ 1 − α t α t x 0 − α t x 0 = α t − 1 x 0 and
Σ t − 1 = σ t 2 I + 1 − α t − 1 − σ t 2 1 − α t ( 1 − α t ) I = ( 1 − α t − 1 ) I \Sigma_{t-1} = \sigma_t^2 \bold{I} + \frac{1 - \alpha_{t-1} - \sigma_t^2}{1 - \alpha_t} (1 - \alpha_t) \bold{I} = (1 - \alpha_{t-1}) \bold{I} Σ t − 1 = σ t 2 I + 1 − α t 1 − α t − 1 − σ t 2 ( 1 − α t ) I = ( 1 − α t − 1 ) I Therefore, q σ ( x t − 1 ∣ x 0 ) = N ( α t − 1 x 0 , ( 1 − α t − 1 ) I ) q_{\sigma} (\bold{x}_{t-1} | \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_{t-1}} \bold{x}_0, (1 - \alpha_{t-1}) \bold{I}) q σ ( x t − 1 ∣ x 0 ) = N ( α t − 1 x 0 , ( 1 − α t − 1 ) I ) , which allows the induction.
End of Proof.
6. Generative Process and Unified Variational Inference Objective Generative Process Intuitively, given a noisy observation x t \bold{x}_t x t , we first make a prediction of the corresponding x 0 \bold{x}_0 x 0 , and the use it to obtain a sample x t − 1 \bold{x}_{t-1} x t − 1 througth the reverse conditional distribution q σ ( x t − 1 ∣ x t , x 0 ) q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) q σ ( x t − 1 ∣ x t , x 0 ) , which is defined above.
For some x 0 ∼ q ( x 0 ) \bold{x}_0 \sim q(\bold{x}_0) x 0 ∼ q ( x 0 ) and ϵ t ∼ N ( 0 , I ) \epsilon_t \sim \mathcal{N}(\bold{0}, \bold{I}) ϵ t ∼ N ( 0 , I ) , x t \bold{x}_t x t can be obtained by the overall step of the forward process:
x t = α t x 0 + 1 − α t ϵ t where ϵ t ∼ N ( 0 , I ) \bold{x}_t = \sqrt{\alpha_t} \bold{x}_0 + \sqrt{1 - \alpha_t} \epsilon_t \qquad \text{where } \epsilon_t \sim \mathcal{N}(\bold{0}, \bold{I}) x t = α t x 0 + 1 − α t ϵ t where ϵ t ∼ N ( 0 , I ) Then model ϵ θ ( t ) ( x t ) \epsilon_{\theta}^{(t)}(\bold{x}_t) ϵ θ ( t ) ( x t ) then attempts to predict ϵ t \epsilon_t ϵ t without the knowledge of x 0 \bold{x}_0 x 0 . By rewriting the above overall step, we can get hte prediction of x 0 \bold{x}_0 x 0 give x t \bold{x}_t x t :
f θ ( t ) ( x t − 1 ∣ x t ) ≔ ( x t − 1 − α t ⋅ ϵ θ ( t ) ( x t ) ) / α t f_{\theta}^{(t)} (\bold{x}_{t-1} | \bold{x}_t) \coloneqq (\bold{x}_t - \sqrt{1 - \alpha_t} \cdot \epsilon_{\theta}^{(t)} (\bold{x}_t)) / \sqrt{\alpha_t} f θ ( t ) ( x t − 1 ∣ x t ) : = ( x t − 1 − α t ⋅ ϵ θ ( t ) ( x t )) / α t We can define a fixed prior p θ ( x T ) = N ( 0 , I ) p_{\theta}(\bold{x}_T) = \mathcal{N}(\bold{0}, \bold{I}) p θ ( x T ) = N ( 0 , I ) for the generative process and
p θ ( t ) ( x t − 1 ∣ x t ) = { N ( f θ ( 1 ) ( x 1 ) , σ 1 2 I ) if t = 1 q σ ( x t − 1 ∣ x t , f θ ( 1 ) ( x t ) ) otherwise p_{\theta}^{(t)} (\bold{x}_{t-1} | \bold{x}_t) =
\begin{cases}
\mathcal{N}(f_{\theta}^{(1)} (\bold{x}_1), \sigma_1^2 \bold{I}) & \text{if }t = 1 \\
q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, f_{\theta}^{(1)} (\bold{x}_t)) & \text{otherwise}
\end{cases} p θ ( t ) ( x t − 1 ∣ x t ) = { N ( f θ ( 1 ) ( x 1 ) , σ 1 2 I ) q σ ( x t − 1 ∣ x t , f θ ( 1 ) ( x t )) if t = 1 otherwise
Variational Inference Objective θ \theta θ can be optimized following variational inference objective (which is a functional over ϵ θ \epsilon_{\theta} ϵ θ ):
J σ ( ϵ θ ) ≔ E x 0 : T ∼ q σ ( x 0 : T ) [ log q σ ( x 1 : T ∣ x 0 ) − log p θ ( x 0 : T ) ] = E x 0 : T ∼ q σ ( x 0 : T ) [ log q σ ( x T ∣ x 0 ) + ∑ t = 2 T log q σ ( x t − 1 ∣ x t , x 0 ) − ∑ t = 1 T log p θ ( t ) ( x t − 1 ∣ x t ) − log p θ ( x T ) ] \begin{align*}
J_{\sigma} (\epsilon_{\theta})
& \coloneqq \mathbb{E}_{\bold{x}_{0:T} \sim q_{\sigma}(\bold{x}_{0:T})} [\log q_{\sigma} (\bold{x}_{1:T} | \bold{x}_0) - \log p_{\theta}(\bold{x}_{0:T})] \\
& =
\mathbb{E}_{\bold{x}_{0:T} \sim q_{\sigma}(\bold{x}_{0:T})} \left[ \log q_{\sigma}(\bold{x}_T | \bold{x}_0) + \sum_{t=2}^T \log q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) - \sum_{t=1}^T \log p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t) - \log p_{\theta}(\bold{x}_T) \right]
\end{align*} J σ ( ϵ θ ) : = E x 0 : T ∼ q σ ( x 0 : T ) [ log q σ ( x 1 : T ∣ x 0 ) − log p θ ( x 0 : T )] = E x 0 : T ∼ q σ ( x 0 : T ) [ log q σ ( x T ∣ x 0 ) + t = 2 ∑ T log q σ ( x t − 1 ∣ x t , x 0 ) − t = 1 ∑ T log p θ ( t ) ( x t − 1 ∣ x t ) − log p θ ( x T ) ] where q σ ( x 1 : T ∣ x 0 ) q_{\sigma} (\bold{x}_{1:T} | \bold{x}_0) q σ ( x 1 : T ∣ x 0 ) is factorized by q σ ( x 1 : T ∣ x 0 ) ≔ q σ ( x T ∣ x 0 ) ∏ t = 2 T q σ ( x t − 1 ∣ x t , x 0 ) q_{\sigma}(\bold{x}_{1:T} | \bold{x}_0) \coloneqq q_{\sigma}(\bold{x}_T | \bold{x}_0) \prod_{t=2}^T q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) q σ ( x 1 : T ∣ x 0 ) : = q σ ( x T ∣ x 0 ) ∏ t = 2 T q σ ( x t − 1 ∣ x t , x 0 ) and p θ ( x 0 : T ) p_{\theta}(\bold{x}_{0:T}) p θ ( x 0 : T ) is factorized by p θ ( x 0 : T ) ≔ p θ ( x T ) ∏ t = 1 T p θ ( t ) ( x t − 1 ∣ x t ) p_{\theta} (\bold{x}_{0:T}) \coloneqq p_{\theta} (\bold{x}_T) \prod_{t=1}^T p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t) p θ ( x 0 : T ) : = p θ ( x T ) ∏ t = 1 T p θ ( t ) ( x t − 1 ∣ x t )
It can be proved that J σ J_{\sigma} J σ is equivalent to L γ L_{\gamma} L γ (objective of DDPM) for certain weights γ \gamma γ . Therefore, the model ϵ θ \epsilon_{\theta} ϵ θ doesn’t require retraining, and we can just use the model trained by DDPM.
Theorem 1. For all σ > 0 \sigma > 0 σ > 0 , there exists γ ∈ R > 0 ( t ) \gamma \in \mathbb{R}_{>0}^{(t)} γ ∈ R > 0 ( t ) and C ∈ R C \in \mathbb{R} C ∈ R , such that J σ = L γ + C J_{\sigma} = L_{\gamma} + C J σ = L γ + C .
Proof:
From the definition of J σ J_{\sigma} J σ :
J σ ≔ E x 0 : T ∼ q ( x 0 : T ) [ log q σ ( x T ∣ x 0 ) + ∑ t = 2 T log q σ ( x t − 1 ∣ x t , x 0 ) − ∑ t = 1 T log p θ ( t ) ( x t − 1 ∣ x t ) ] ≡ E x 0 : T ∼ q ( x 0 : T ) [ ∑ t = 2 T D KL ( q σ ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( t ) ( x t − 1 ∣ x t ) ) − log p θ ( 1 ) ( x 0 ∣ x 1 ) ] \begin{align*}
J_{\sigma} & \coloneqq \mathbb{E}_{\bold{x}_{0:T} \sim q(\bold{x}_{0:T})} \left[ \log q_{\sigma}(\bold{x}_T | \bold{x}_0) + \sum_{t=2}^T \log q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) - \sum_{t=1}^T \log p_{\theta}^{(t)} (\bold{x}_{t-1} | \bold{x}_t) \right] \\
& \equiv \mathbb{E}_{\bold{x}_{0:T} \sim q(\bold{x}_{0:T})} \left[ \sum_{t=2}^T D_{\text{KL}}(q_{\sigma}(\bold{x}_{t-1}|\bold{x}_t, \bold{x}_0)||p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)) - \log p_{\theta}^{(1)}(\bold{x}_0 | \bold{x}_1) \right]
\end{align*} J σ : = E x 0 : T ∼ q ( x 0 : T ) [ log q σ ( x T ∣ x 0 ) + t = 2 ∑ T log q σ ( x t − 1 ∣ x t , x 0 ) − t = 1 ∑ T log p θ ( t ) ( x t − 1 ∣ x t ) ] ≡ E x 0 : T ∼ q ( x 0 : T ) [ t = 2 ∑ T D KL ( q σ ( x t − 1 ∣ x t , x 0 ) ∣∣ p θ ( t ) ( x t − 1 ∣ x t )) − log p θ ( 1 ) ( x 0 ∣ x 1 ) ] where we use ≡ \equiv ≡ to denote “equal up to a value that does not depend on ϵ θ \epsilon_{\theta} ϵ θ (but may depend on q σ q_{\sigma} q σ )”.
☝
How is the ≡ \equiv ≡ derived?
Consider the part of ∑ t = 2 T log q σ ( x t − 1 ∣ x t , x 0 ) p θ ( t ) ( x t − 1 ∣ x t ) \sum_{t=2}^T \log \frac{q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0)}{p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)} ∑ t = 2 T log p θ ( t ) ( x t − 1 ∣ x t ) q σ ( x t − 1 ∣ x t , x 0 ) :
E x 0 : T ∼ q σ ( x 0 : T ) [ ∑ t = 2 T log q σ ( x t − 1 ∣ x t , x 0 ) p θ ( t ) ( x t − 1 ∣ x t ) ] = ∑ t = 2 T E ( x 0 , x t − 1 , x t ) ∼ q σ ( x 0 , x t − 1 , x t ) [ log q σ ( x t − 1 ∣ x t , x 0 ) p θ ( t ) ( x t − 1 ∣ x t ) ] = ∑ t = 2 T ∫ q σ ( x 0 , x t − 1 , x t ) ⋅ log q σ ( x t − 1 ∣ x t , x 0 ) p θ ( t ) ( x t − 1 ∣ x t ) d x 0 d x t − 1 d x t = ∑ t = 2 T ∫ q σ ( x t , x 0 ) ⋅ q σ ( x t − 1 ∣ x t , x 0 ) ⋅ log q σ ( x t − 1 ∣ x t , x 0 ) p θ ( t ) ( x t − 1 ∣ x t ) d x 0 d x t − 1 d x t = ∑ t = 2 T E ( x 0 , x t ) ∼ q σ ( x 0 , x t ) [ ∫ q σ ( x t − 1 ∣ x t , x 0 ) ⋅ log q σ ( x t − 1 ∣ x t , x 0 ) p θ ( t ) ( x t − 1 ∣ x t ) d x t − 1 ] = ∑ t = 2 T E ( x 0 , x t ) ∼ q σ ( x 0 , x t ) [ D KL ( q σ ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( t ) ( x t − 1 ∣ x t ) ) ] = E x 0 : T ∼ q σ ( x 0 : T ) [ D KL ( q σ ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( t ) ( x t − 1 ∣ x t ) ) ] \begin{align*}
&\mathbb{E}_{\bold{x}_{0:T} \sim q_{\sigma}(\bold{x}_{0:T})}\left[ \sum_{t=2}^T \log \frac{q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0)}{p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)} \right] \\
=& \sum_{t=2}^T \mathbb{E}_{(\bold{x}_0, \bold{x}_{t-1}, \bold{x}_t) \sim q_{\sigma}(\bold{x}_0, \bold{x}_{t-1}, \bold{x}_t)} \left[\log \frac{q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0)}{p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)} \right] \\
=& \sum_{t=2}^T \int q_{\sigma}(\bold{x}_0, \bold{x}_{t-1}, \bold{x}_t) \cdot \log \frac{q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0)}{p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)} \quad d\bold{x}_0 d\bold{x}_{t-1} d\bold{x}_t \\
=& \sum_{t=2}^T \int q_{\sigma}(\bold{x}_t, \bold{x}_0) \cdot q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) \cdot \log \frac{q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0)}{p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)} \quad d\bold{x}_0 d\bold{x}_{t-1} d\bold{x}_t \\
=& \sum_{t=2}^T \mathbb{E}_{(\bold{x}_0, \bold{x}_t) \sim q_{\sigma}(\bold{x}_0, \bold{x}_t)} \left[ \int q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) \cdot \log \frac{q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0)}{p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t)} \quad d \bold{x}_{t-1} \right] \\
=& \sum_{t=2}^T \mathbb{E}_{(\bold{x}_0, \bold{x}_t) \sim q_{\sigma}(\bold{x}_0, \bold{x}_t)} \left[ D_{\text{KL}}(q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) || p_{\theta}^{(t)}(\bold{x}_{t-1}|\bold{x}_t) )\right] \\
=& \mathbb{E}_{\bold{x}_{0:T} \sim q_{\sigma}(\bold{x}_{0:T})} \left[ D_{\text{KL}}(q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) || p_{\theta}^{(t)}(\bold{x}_{t-1}|\bold{x}_t) )\right]
\end{align*} = = = = = = E x 0 : T ∼ q σ ( x 0 : T ) [ t = 2 ∑ T log p θ ( t ) ( x t − 1 ∣ x t ) q σ ( x t − 1 ∣ x t , x 0 ) ] t = 2 ∑ T E ( x 0 , x t − 1 , x t ) ∼ q σ ( x 0 , x t − 1 , x t ) [ log p θ ( t ) ( x t − 1 ∣ x t ) q σ ( x t − 1 ∣ x t , x 0 ) ] t = 2 ∑ T ∫ q σ ( x 0 , x t − 1 , x t ) ⋅ log p θ ( t ) ( x t − 1 ∣ x t ) q σ ( x t − 1 ∣ x t , x 0 ) d x 0 d x t − 1 d x t t = 2 ∑ T ∫ q σ ( x t , x 0 ) ⋅ q σ ( x t − 1 ∣ x t , x 0 ) ⋅ log p θ ( t ) ( x t − 1 ∣ x t ) q σ ( x t − 1 ∣ x t , x 0 ) d x 0 d x t − 1 d x t t = 2 ∑ T E ( x 0 , x t ) ∼ q σ ( x 0 , x t ) [ ∫ q σ ( x t − 1 ∣ x t , x 0 ) ⋅ log p θ ( t ) ( x t − 1 ∣ x t ) q σ ( x t − 1 ∣ x t , x 0 ) d x t − 1 ] t = 2 ∑ T E ( x 0 , x t ) ∼ q σ ( x 0 , x t ) [ D KL ( q σ ( x t − 1 ∣ x t , x 0 ) ∣∣ p θ ( t ) ( x t − 1 ∣ x t )) ] E x 0 : T ∼ q σ ( x 0 : T ) [ D KL ( q σ ( x t − 1 ∣ x t , x 0 ) ∣∣ p θ ( t ) ( x t − 1 ∣ x t )) ] For t > 1 t > 1 t > 1 , by the definition of p θ ( t ) ( x t − 1 ∣ x t ) p_{\theta}^{(t)}(\bold{x}_{t-1} | \bold{x}_t) p θ ( t ) ( x t − 1 ∣ x t ) :
E x 0 , x t ∼ q σ ( x 0 , x t ) [ D KL ( q σ ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( t ) ( x t − 1 ∣ x t ) ) ] = E x 0 , x t ∼ q σ ( x 0 , x t ) [ D KL ( q σ ( x t − 1 ∣ x t , x 0 ) ∣ ∣ q σ ( x t − 1 ∣ x t , f θ ( t ) ( x t ) ) ) ] ≡ E x 0 , x t ∼ q σ ( x 0 , x t ) [ ∣ ∣ x 0 − f θ ( t ) ( x t ) ∣ ∣ 2 2 2 σ t 2 ] = E x 0 ∼ q ( x 0 ) , ϵ ∼ N ( 0 , I ) , x t = α t x 0 + 1 − α t ϵ [ ∣ ∣ ( x t − 1 − α t ϵ ) α t − ( x t − 1 − α t ϵ θ ( t ) ( x t ) ) α t ∣ ∣ 2 2 2 σ t 2 ] = E x 0 ∼ q ( x 0 ) , ϵ ∼ N ( 0 , I ) , x t = α t x 0 + 1 − α t ϵ [ ∣ ∣ ϵ − ϵ θ ( t ) ( x t ) ∣ ∣ 2 2 2 d σ t 2 α t ] \begin{align*}
&\mathbb{E}_{\bold{x}_0, \bold{x}_t \sim q_{\sigma}(\bold{x}_0, \bold{x}_t)} \left[ D_{\text{KL}}(q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) || p_{\theta}^{(t)}(\bold{x}_{t-1}|\bold{x}_t) )\right] \\
=& \mathbb{E}_{\bold{x}_0, \bold{x}_t \sim q_{\sigma}(\bold{x}_0, \bold{x}_t)} \left[ D_{\text{KL}}(q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) || q_{\sigma}(\bold{x}_{t-1}|\bold{x}_t, f_{\theta}^{(t)}(\bold{x}_t)) )\right] \\
\equiv & \mathbb{E}_{\bold{x}_0, \bold{x}_t \sim q_{\sigma}(\bold{x}_0, \bold{x}_t)} \left[ \frac{||\bold{x}_0 - f_{\theta}^{(t)}(\bold{x}_t)||_2^2}{2 \sigma_t^2} \right] \\
=& \mathbb{E}_{\bold{x}_0 \sim q(\bold{x}_0), \epsilon \sim \mathcal{N}(\bold{0}, \bold{I}), \bold{x}_t=\sqrt{\alpha_t}\bold{x}_0 + \sqrt{1 - \alpha_t}\epsilon} \left[ \frac{||\frac{(\bold{x}_t - \sqrt{1 - \alpha_t}\epsilon)}{\sqrt{\alpha_t}} - \frac{(\bold{x}_t - \sqrt{1 - \alpha_t}\epsilon_{\theta}^{(t)}(\bold{x}_t))}{\sqrt{\alpha_t}}||_2^2}{2 \sigma_t^2} \right] \\
=& \mathbb{E}_{\bold{x}_0 \sim q(\bold{x}_0), \epsilon \sim \mathcal{N}(\bold{0}, \bold{I}), \bold{x}_t=\sqrt{\alpha_t}\bold{x}_0 + \sqrt{1 - \alpha_t}\epsilon}\left[ \frac{||\epsilon - \epsilon_{\theta}^{(t)}(\bold{x}_t)||_2^2}{2d\sigma_t^2 \alpha_t} \right]
\end{align*} = ≡ = = E x 0 , x t ∼ q σ ( x 0 , x t ) [ D KL ( q σ ( x t − 1 ∣ x t , x 0 ) ∣∣ p θ ( t ) ( x t − 1 ∣ x t )) ] E x 0 , x t ∼ q σ ( x 0 , x t ) [ D KL ( q σ ( x t − 1 ∣ x t , x 0 ) ∣∣ q σ ( x t − 1 ∣ x t , f θ ( t ) ( x t ))) ] E x 0 , x t ∼ q σ ( x 0 , x t ) [ 2 σ t 2 ∣∣ x 0 − f θ ( t ) ( x t ) ∣ ∣ 2 2 ] E x 0 ∼ q ( x 0 ) , ϵ ∼ N ( 0 , I ) , x t = α t x 0 + 1 − α t ϵ 2 σ t 2 ∣∣ α t ( x t − 1 − α t ϵ ) − α t ( x t − 1 − α t ϵ θ ( t ) ( x t )) ∣ ∣ 2 2 E x 0 ∼ q ( x 0 ) , ϵ ∼ N ( 0 , I ) , x t = α t x 0 + 1 − α t ϵ [ 2 d σ t 2 α t ∣∣ ϵ − ϵ θ ( t ) ( x t ) ∣ ∣ 2 2 ] where d d d is the dimension of x 0 \bold{x}_0 x 0 .
☝
Line 3 “ ≡ \equiv ≡ ”:
Recall that the KL Divergence between tow Gaussian distributions is:
D KL ( N ( x ; μ x , Σ x ) ∣ ∣ ( N ( y ; μ y , Σ y ) ) = 1 2 [ log ∣ Σ y ∣ ∣ Σ x ∣ − d + tr ( Σ y − 1 Σ x ) + ( μ y − μ x ) T Σ y − 1 ( μ y − μ x ) ] D_{\text{KL}} (\mathcal{N}(\bold{x}; \mu_{x}, \Sigma_x)||(\mathcal{N}(\bold{y}; \mu_{y}, \Sigma_y)) = \frac{1}{2} \left[ \log \frac{|\Sigma_y|}{|\Sigma_x|} - d + \text{tr}(\Sigma_y^{-1} \Sigma_x) + (\mu_y - \mu_x)^T \Sigma_y^{-1} (\mu_y - \mu_x) \right] D KL ( N ( x ; μ x , Σ x ) ∣∣ ( N ( y ; μ y , Σ y )) = 2 1 [ log ∣ Σ x ∣ ∣ Σ y ∣ − d + tr ( Σ y − 1 Σ x ) + ( μ y − μ x ) T Σ y − 1 ( μ y − μ x ) ] When Σ x = Σ y = σ t 2 I \Sigma_x = \Sigma_y = \sigma_t^2 \bold{I} Σ x = Σ y = σ t 2 I , this simplifies to:
D KL ( N ( x ; μ x , Σ x ) ∣ ∣ ( N ( y ; μ y , Σ y ) ) = 1 2 σ t 2 ∣ ∣ μ x − μ y ∣ ∣ 2 2 D_{\text{KL}} (\mathcal{N}(\bold{x}; \mu_{x}, \Sigma_x)||(\mathcal{N}(\bold{y}; \mu_{y}, \Sigma_y)) = \frac{1}{2 \sigma_t^2} ||\mu_x - \mu_y||_2^2 D KL ( N ( x ; μ x , Σ x ) ∣∣ ( N ( y ; μ y , Σ y )) = 2 σ t 2 1 ∣∣ μ x − μ y ∣ ∣ 2 2 The expectation of q σ ( x t − 1 ∣ x t , x 0 ) q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) q σ ( x t − 1 ∣ x t , x 0 ) can be written as n ⋅ x t + m ⋅ x 0 n \cdot \bold{x}_t + m \cdot \bold{x}_0 n ⋅ x t + m ⋅ x 0 where n n n and m m m are known constant. Then,
E x 0 , x t ∼ q σ ( x 0 , x t ) [ D KL ( q σ ( x t − 1 ∣ x t , x 0 ) ∣ ∣ q σ ( x t − 1 ∣ x t , f θ ( t ) ( x t ) ) ) ] = E x 0 , x t ∼ q σ ( x 0 , x t ) [ m 2 1 2 σ t 2 ∣ ∣ μ x − μ y ∣ ∣ 2 2 ] = E x 0 , x t ∼ q σ ( x 0 , x t ) [ 1 2 σ t 2 ∣ ∣ μ x − μ y ∣ ∣ 2 2 ] \begin{align*}
& \mathbb{E}_{\bold{x}_0, \bold{x}_t \sim q_{\sigma}(\bold{x}_0, \bold{x}_t)} \left[ D_{\text{KL}}(q_{\sigma}(\bold{x}_{t-1} | \bold{x}_t, \bold{x}_0) || q_{\sigma}(\bold{x}_{t-1}|\bold{x}_t, f_{\theta}^{(t)}(\bold{x}_t)) )\right] \\
=& \mathbb{E}_{\bold{x}_0, \bold{x}_t \sim q_{\sigma}(\bold{x}_0, \bold{x}_t)} \left[ m^2 \frac{1}{2 \sigma_t^2} ||\mu_x - \mu_y||_2^2 \right] \\
=& \mathbb{E}_{\bold{x}_0, \bold{x}_t \sim q_{\sigma}(\bold{x}_0, \bold{x}_t)} \left[ \frac{1}{2 \sigma_t^2} ||\mu_x - \mu_y||_2^2 \right]
\end{align*} = = E x 0 , x t ∼ q σ ( x 0 , x t ) [ D KL ( q σ ( x t − 1 ∣ x t , x 0 ) ∣∣ q σ ( x t − 1 ∣ x t , f θ ( t ) ( x t ))) ] E x 0 , x t ∼ q σ ( x 0 , x t ) [ m 2 2 σ t 2 1 ∣∣ μ x − μ y ∣ ∣ 2 2 ] E x 0 , x t ∼ q σ ( x 0 , x t ) [ 2 σ t 2 1 ∣∣ μ x − μ y ∣ ∣ 2 2 ] Question: Why in Line 5, there is suddenly a d d d in the denominator and the 1 − α t \sqrt{1 - \alpha_t} 1 − α t disappears in the nominator???
For t = 1 t=1 t = 1 :
E x 0 , x 1 ∼ q σ ( x 0 , x 1 ) [ − log p θ ( 1 ) ( x 0 ∣ x 1 ) ] = E x 0 , x 1 ∼ q σ ( x 0 , x 1 ) [ constant + ∣ ∣ x 0 − f θ ( 1 ) ( x 1 ) ∣ ∣ 2 2 2 σ 1 2 ] ≡ E x 0 , x 1 ∼ q σ ( x 0 , x 1 ) [ ∣ ∣ x 0 − f θ ( 1 ) ( x 1 ) ∣ ∣ 2 2 2 σ 1 2 ] = E x 0 ∼ q ( x 0 ) , ϵ ∼ N ( 0 , I ) , x 1 = α 1 x 0 + 1 − α 1 ϵ [ ∣ ∣ x 1 − 1 − α 1 ϵ α 1 − x 1 − 1 − α 1 ϵ θ ( 1 ) ( x 1 ) α 1 ∣ ∣ 2 2 2 σ 1 2 ] = E x 0 ∼ q ( x 0 ) , ϵ ∼ N ( 0 , I ) , x 1 = α 1 x 0 + 1 − α 1 ϵ [ ∣ ∣ ϵ − ϵ θ ( 1 ) ( x 1 ) ∣ ∣ 2 2 2 d σ 1 2 α 1 ] \begin{align*}
&\mathbb{E}_{\bold{x}_0, \bold{x}_1 \sim q_{\sigma}(\bold{x}_0, \bold{x}_1)} \left[ -\log p_{\theta}^{(1)} (\bold{x}_0 | \bold{x}_1) \right] \\
=& \mathbb{E}_{\bold{x}_0, \bold{x}_1 \sim q_{\sigma}(\bold{x}_0, \bold{x}_1)} \left[ \text{constant} + \frac{||\bold{x}_0 - f_{\theta}^{(1)}(\bold{x}_1)||_2^2}{2\sigma_1^2} \right] \\
\equiv & \mathbb{E}_{\bold{x}_0, \bold{x}_1 \sim q_{\sigma}(\bold{x}_0, \bold{x}_1)} \left[ \frac{||\bold{x}_0 - f_{\theta}^{(1)}(\bold{x}_1)||_2^2}{2\sigma_1^2} \right] \\
=& \mathbb{E}_{\bold{x}_0 \sim q(\bold{x}_0), \epsilon \sim \mathcal{N}(\bold{0}, \bold{I}), \bold{x}_1=\sqrt{\alpha_1} \bold{x}_0 + \sqrt{1 - \alpha_1}\epsilon} \left[ \frac{||\frac{\bold{x}_1 - \sqrt{1 - \alpha_1}\epsilon}{\sqrt{\alpha_1}} - \frac{\bold{x}_1 - \sqrt{1 - \alpha_1}\epsilon_{\theta}^{(1)}(\bold{x}_1)}{\sqrt{\alpha_1}}||_2^2}{2\sigma_1^2} \right] \\
=& \mathbb{E}_{\bold{x}_0 \sim q(\bold{x}_0), \epsilon \sim \mathcal{N}(\bold{0}, \bold{I}), \bold{x}_1=\sqrt{\alpha_1} \bold{x}_0 + \sqrt{1 - \alpha_1}\epsilon} \left[ \frac{||\epsilon - \epsilon_{\theta}^{(1)}(\bold{x}_1)||_2^2}{2d\sigma_1^2 \alpha_1} \right]
\end{align*} = ≡ = = E x 0 , x 1 ∼ q σ ( x 0 , x 1 ) [ − log p θ ( 1 ) ( x 0 ∣ x 1 ) ] E x 0 , x 1 ∼ q σ ( x 0 , x 1 ) [ constant + 2 σ 1 2 ∣∣ x 0 − f θ ( 1 ) ( x 1 ) ∣ ∣ 2 2 ] E x 0 , x 1 ∼ q σ ( x 0 , x 1 ) [ 2 σ 1 2 ∣∣ x 0 − f θ ( 1 ) ( x 1 ) ∣ ∣ 2 2 ] E x 0 ∼ q ( x 0 ) , ϵ ∼ N ( 0 , I ) , x 1 = α 1 x 0 + 1 − α 1 ϵ 2 σ 1 2 ∣∣ α 1 x 1 − 1 − α 1 ϵ − α 1 x 1 − 1 − α 1 ϵ θ ( 1 ) ( x 1 ) ∣ ∣ 2 2 E x 0 ∼ q ( x 0 ) , ϵ ∼ N ( 0 , I ) , x 1 = α 1 x 0 + 1 − α 1 ϵ [ 2 d σ 1 2 α 1 ∣∣ ϵ − ϵ θ ( 1 ) ( x 1 ) ∣ ∣ 2 2 ] ☝
In the last line, somehow the d d d appears and 1 − α 1 \sqrt{1 - \alpha_1} 1 − α 1 disappears again.
Therefore, when γ t = 1 / ( 2 d σ t 2 α t ) \gamma_t = 1 / (2 d \sigma_t^2 \alpha_t) γ t = 1/ ( 2 d σ t 2 α t ) , for all t ∈ { 1 , … , T } t \in \{1, \dots, T\} t ∈ { 1 , … , T } , we have
J σ ( ϵ θ ) ≡ ∑ t = 1 T 1 2 d σ t 2 α t E [ ∣ ∣ ϵ θ ( t ) ( x t ) − ϵ t ∣ ∣ 2 2 ] = L γ ( ϵ θ ) J_{\sigma}(\epsilon_{\theta}) \equiv \sum_{t=1}^T \frac{1}{2d\sigma_t^2 \alpha_t} \mathbb{E}\left[ ||\epsilon_{\theta}^{(t)}(\bold{x}_t) - \epsilon_t||_2^2 \right] = L_{\gamma}(\epsilon_{\theta}) J σ ( ϵ θ ) ≡ t = 1 ∑ T 2 d σ t 2 α t 1 E [ ∣∣ ϵ θ ( t ) ( x t ) − ϵ t ∣ ∣ 2 2 ] = L γ ( ϵ θ ) for all ϵ θ \epsilon_{\theta} ϵ θ .
End of Proof.
7. Sampling from Generalized Generative Processes IDEA: Use pretrained DDPM models as the solutions to the new objectives, and focus on finding a generative process that is better at producing samples by changing σ \sigma σ .
Denoising Diffusion Implicit Models One can generate a sample x t − 1 \bold{x}_{t-1} x t − 1 from a sample x t \bold{x}_t x t via:
x t − 1 = α t − 1 ( x t − 1 − α t ϵ θ ( t ) ( x t ) α t ) ⏟ "predicted x 0 " + 1 − α t − 1 − σ t 2 ⋅ ϵ θ ( t ) ( x t ) ⏟ "direction pointing to x t " + σ t ϵ t ⏟ random noise \bold{x}_{t-1} = \sqrt{\alpha_{t-1}} \underbrace{\left( \frac{\bold{x}_t - \sqrt{1 - \alpha_t} \epsilon_{\theta}^{(t)}(\bold{x}_t)}{\sqrt{\alpha_t}} \right)}_{\text{"predicted } \bold{x}_0"} + \underbrace{\sqrt{1 - \alpha_{t-1} - \sigma_t^2} \cdot \epsilon_{\theta}^{(t)} (\bold{x}_t)}_{\text{"direction pointing to } \bold{x}_t "} + \underbrace{\sigma_t \epsilon_t}_{\text{random noise}} x t − 1 = α t − 1 "predicted x 0 " ( α t x t − 1 − α t ϵ θ ( t ) ( x t ) ) + "direction pointing to x t " 1 − α t − 1 − σ t 2 ⋅ ϵ θ ( t ) ( x t ) + random noise σ t ϵ t where ϵ t ∼ N ( 0 , I ) \epsilon_t \sim \mathcal{N}(\bold{0}, \bold{I}) ϵ t ∼ N ( 0 , I ) is standard Gaussian noise independent of x t \bold{x}_t x t , and define α 0 ≔ 1 \alpha_0 \coloneqq 1 α 0 : = 1 .
Different σ \sigma σ values results in different generative processes, all using the same model ϵ θ \epsilon_{\theta} ϵ θ , so re-training the model is uncessary. When σ t = ( 1 − α t − 1 / ( 1 − α t ) 1 − α t / α t − 1 \sigma_t = \sqrt{(1 - \alpha_{t-1} / (1 - \alpha_t)} \sqrt{1 - \alpha_t / \alpha_{t-1}} σ t = ( 1 − α t − 1 / ( 1 − α t ) 1 − α t / α t − 1 for all t, the forward process becomes Markovian and the generative process becomes a DDPM.
Another special case is σ t = 0 \sigma_t = 0 σ t = 0 for all t t t . The forward process q σ ( x t ∣ x t − 1 , x 0 ) = q σ ( x t − 1 ∣ x t , x 0 ) q σ ( x t ∣ x 0 ) q σ ( x t − 1 ∣ x 0 ) q_{\sigma}(\bold{x}_t | \bold{x}_{t-1}, \bold{x}_0) = \frac{q_{\sigma}(\bold{x}_{t-1} | \bold{x}_{t}, \bold{x}_0) q_{\sigma}(\bold{x}_t | \bold{x}_0)}{q_{\sigma}(\bold{x}_{t-1} | \bold{x}_0)} q σ ( x t ∣ x t − 1 , x 0 ) = q σ ( x t − 1 ∣ x 0 ) q σ ( x t − 1 ∣ x t , x 0 ) q σ ( x t ∣ x 0 ) is now deterministic given x t − 1 \bold{x}_{t-1} x t − 1 and x 0 \bold{x}_0 x 0 , except for t = 1 t = 1 t = 1 (because q σ ( x 1 ∣ x 0 ) q_{\sigma}(\bold{x}_1 | \bold{x}_0) q σ ( x 1 ∣ x 0 ) can not be derived from the above Bayesian Rule). The generative also becomes deterministic because the coefficient of the random noise ϵ t \epsilon_{t} ϵ t becomes zero.
This model with σ t = 0 \sigma_t = 0 σ t = 0 is named denoising diffusion implicit model (DDIM).
More generally, we can define σ t = η ⋅ σ DDPM \sigma_t = \eta \cdot \sigma_{\text{DDPM}} σ t = η ⋅ σ DDPM . When η = 1 \eta = 1 η = 1 , it’s DDPM. When η = 0 \eta = 0 η = 0 , it’s DDIM.
Accelerated Generation Processes As the denoising objective L 1 L_{1} L 1 doesn’t depend on the specific forward procedure as long as q σ ( x t ∣ x 0 ) q_{\sigma}(\bold{x}_t | \bold{x}_0) q σ ( x t ∣ x 0 ) is fixed, we may consider forward processes with lengths samller than T T T . which accelerates the corresponding generative processes without having to train a different model.
Consider a subsequence { x τ 1 , … , x τ S } \{ \bold{x}_{\tau_1} , \dots, \bold{x}_{\tau_S} \} { x τ 1 , … , x τ S } of { 1 , … T } \{ 1, \dots T \} { 1 , … T } . A new forward process is defined: x τ 1 … x τ S \bold{x}_{\tau_1} \dots \bold{x}_{\tau_S} x τ 1 … x τ S such that q ( x τ i ∣ x 0 ) = N ( α τ i x 0 , ( 1 − α τ i ) I ) q(\bold{x}_{\tau_i} | \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_{\tau_i}}\bold{x}_0, (1 - \alpha_{\tau_i}) \bold{I}) q ( x τ i ∣ x 0 ) = N ( α τ i x 0 , ( 1 − α τ i ) I ) . The generative process now samples latent variables according to reversed(τ \tau τ ), which is termed (sampling) trajectory .
Now we show that the training objective is still unchanged so that we don’t need to retrain the model.
Consider the inference process to be factored as:
q σ , τ ( x t ∣ x 0 ) = q σ , τ ( x τ S ∣ x 0 ) ∏ i = 1 S q σ , τ ( x τ i − 1 ∣ x τ i , x 0 ) ∏ t ∈ τ ˉ q σ . τ ( x t ∣ x 0 ) q_{\sigma, \tau}(\bold{x}_t | \bold{x}_0) = q_{\sigma, \tau}(\bold{x}_{\tau_S} | \bold{x}_0) \prod_{i=1}^{S} q_{\sigma, \tau}(\bold{x}_{\tau_{i-1}} | \bold{x}_{\tau_i}, \bold{x}_0) \prod_{t \in \bar{\tau}}q_{\sigma. \tau}(\bold{x}_t | \bold{x}_0) q σ , τ ( x t ∣ x 0 ) = q σ , τ ( x τ S ∣ x 0 ) i = 1 ∏ S q σ , τ ( x τ i − 1 ∣ x τ i , x 0 ) t ∈ τ ˉ ∏ q σ . τ ( x t ∣ x 0 ) where τ \tau τ is a sub-sequence of [ 1 , … , T ] [1, \dots, T] [ 1 , … , T ] of length S S S with τ S = T \tau_S = T τ S = T , and let τ ˉ ≔ { 1 , … , T } \ τ \bar{\tau} \coloneqq \{ 1, \dots, T \} \backslash \tau τ ˉ : = { 1 , … , T } \ τ be its complement. We define:
q σ . τ ( x t ∣ x 0 ) = N ( α t x 0 , ( 1 − α t ) I ) ∀ t ∈ τ ˉ ∪ { T } q_{\sigma. \tau}(\bold{x}_t | \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_t} \bold{x}_0, (1 - \alpha_t) \bold{I}) \qquad \forall t \in \bar{\tau} \cup \{ T \} q σ . τ ( x t ∣ x 0 ) = N ( α t x 0 , ( 1 − α t ) I ) ∀ t ∈ τ ˉ ∪ { T } q σ , τ ( x τ i − 1 ∣ x τ i , x 0 ) = N ( α τ i − 1 x 0 + 1 − α τ i − 1 − σ τ i 2 ⋅ x τ i − α τ i x 0 1 − α τ i , σ τ i 2 I ) ∀ i ∈ [ S ] q_{\sigma, \tau}(\bold{x}_{\tau_{i-1}} | \bold{x}_{\tau_i}, \bold{x}_0) = \mathcal{N} \left( \sqrt{\alpha_{\tau_{i-1}}}\bold{x}_0 + \sqrt{1 - \alpha_{\tau_{i-1}} - \sigma_{\tau_i}^2} \cdot \frac{\bold{x_{\tau_i}-\sqrt{\alpha_{\tau_i}}\bold{x}_0}}{\sqrt{1 - \alpha_{\tau_i}}}, \sigma_{\tau_i}^2 \bold{I} \right) \quad \forall i \in [S] q σ , τ ( x τ i − 1 ∣ x τ i , x 0 ) = N ( α τ i − 1 x 0 + 1 − α τ i − 1 − σ τ i 2 ⋅ 1 − α τ i x τ i − α τ i x 0 , σ τ i 2 I ) ∀ i ∈ [ S ] where coefficients are chosen such that:
q σ , τ ( x τ i ∣ x 0 ) = N ( α τ i x 0 , ( 1 − α τ i ) I ) ∀ i ∈ [ S ] q_{\sigma, \tau}(\bold{x}_{\tau_i} | \bold{x}_0) = \mathcal{N}(\sqrt{\alpha_{\tau_i}} \bold{x}_0, (1 - \alpha_{\tau_i})\bold{I}) \quad \forall i \in [S] q σ , τ ( x τ i ∣ x 0 ) = N ( α τ i x 0 , ( 1 − α τ i ) I ) ∀ i ∈ [ S ] i.e., the “marginals” match.
The corresponding “generative process” is defined as:
p θ ( x 0 : T ) ≔ p θ ( x T ) ∏ i = 1 S p θ τ i ( x τ i − 1 ∣ x τ 1 ) ⏟ use to produce samples × ∏ t ∈ τ ˉ p θ ( t ) ( x 0 ∣ x t ) ⏟ in variational objective p_{\theta}(\bold{x}_{0:T}) \coloneqq \underbrace{p_{\theta}(\bold{x}_T)\prod_{i=1}^S p_{\theta}^{\tau_i}(\bold{x}_{\tau_{i-1}} | \bold{x}_{\tau_1})}_{\text{use to produce samples}} \times \underbrace{\prod_{t \in \bar{\tau}}p_{\theta}^{(t)} (\bold{x}_0 | \bold{x}_t)}_{\text{in variational objective}} p θ ( x 0 : T ) : = use to produce samples p θ ( x T ) i = 1 ∏ S p θ τ i ( x τ i − 1 ∣ x τ 1 ) × in variational objective t ∈ τ ˉ ∏ p θ ( t ) ( x 0 ∣ x t ) where only part of the models are being used to produce samples. The conditionals are:
p θ ( τ i ) ( x τ i − 1 ∣ x τ i ) = q σ , τ ( x τ i − 1 ∣ x τ i , f θ ( τ i ) ( x τ i − 1 ) ) if i ∈ [ S ] , i > 1 p_{\theta}^{(\tau_i)}(\bold{x}_{\tau_{i-1}} | \bold{x}_{\tau_i}) = q_{\sigma, \tau} (\bold{x}_{\tau_{i-1}} | \bold{x}_{\tau_i}, f_{\theta}^{(\tau_i)}(\bold{x}_{\tau_{i-1}})) \quad \text{if } i\in [S], i>1 p θ ( τ i ) ( x τ i − 1 ∣ x τ i ) = q σ , τ ( x τ i − 1 ∣ x τ i , f θ ( τ i ) ( x τ i − 1 )) if i ∈ [ S ] , i > 1 p θ ( t ) ( x 0 ∣ x t ) = N ( f θ ( t ) ( x t ) , σ t 2 I ) otherwise, p_{\theta}^{(t)}(\bold{x}_0 | \bold{x}_t) = \mathcal{N}(f_{\theta}^{(t)}(\bold{x}_t), \sigma_t^2 \bold{I}) \quad \text{otherwise,} p θ ( t ) ( x 0 ∣ x t ) = N ( f θ ( t ) ( x t ) , σ t 2 I ) otherwise, where we leverage q σ , τ ( x τ i − 1 ∣ x τ i , x 0 ) q_{\sigma, \tau}(\bold{x}_{\tau_{i-1}} | \bold{x}_{\tau_i}, \bold{x}_0) q σ , τ ( x τ i − 1 ∣ x τ i , x 0 ) as part of the inference process. The resulting variational objective becomes (define x τ S + 1 = ∅ \bold{x}_{\tau_{S + 1}} = \empty x τ S + 1 = ∅ for conciseness):
J ( ϵ θ ) = E x 0 : T ∼ q σ , τ ( x 0 : T ) [ log q σ , τ ( x 1 : T ∣ x 0 ) − log p θ ( x 0 : T ) ] = E x 0 : T ∼ q σ , τ ( x 0 : T ) [ ∑ t ∈ τ ˉ D KL ( q σ , τ ( x t ∣ x 0 ) ∣ ∣ p θ ( t ) ( x 0 ∣ x t ) ) + ∑ i = 1 S D KL ( q σ . τ ( x τ i − 1 ∣ x τ i , x 0 ) ∣ ∣ p θ ( τ i ) ( x τ i − 1 ∣ x τ i ) ) ] \begin{align*}
J(\epsilon_{\theta})
&= \mathbb{E}_{\bold{x}_{0:T} \sim q_{\sigma, \tau}(\bold{x}_{0:T})}[\log q_{\sigma, \tau}(\bold{x}_{1:T} | \bold{x}_0) - \log p_{\theta}(\bold{x}_{0:T})] \\
&= \mathbb{E}_{\bold{x}_{0:T} \sim q_{\sigma, \tau}(\bold{x}_{0:T})} \left[ \sum_{t \in \bar{\tau}}D_{\text{KL}}(q_{\sigma, \tau} (\bold{x}_t | \bold{x}_0) || p_{\theta}^{(t)}(\bold{x}_0 | \bold{x}_t)) + \sum_{i=1}^S D_{\text{KL}}(q_{\sigma. \tau}(\bold{x}_{\tau_{i-1}} | \bold{x}_{\tau_i}, \bold{x}_0) || p_{\theta}^{(\tau_i)}(\bold{x}_{\tau_{i-1}} | \bold{x}_{\tau_i})) \right]
\end{align*} J ( ϵ θ ) = E x 0 : T ∼ q σ , τ ( x 0 : T ) [ log q σ , τ ( x 1 : T ∣ x 0 ) − log p θ ( x 0 : T )] = E x 0 : T ∼ q σ , τ ( x 0 : T ) [ t ∈ τ ˉ ∑ D KL ( q σ , τ ( x t ∣ x 0 ) ∣∣ p θ ( t ) ( x 0 ∣ x t )) + i = 1 ∑ S D KL ( q σ . τ ( x τ i − 1 ∣ x τ i , x 0 ) ∣∣ p θ ( τ i ) ( x τ i − 1 ∣ x τ i )) ] where each KL divergence is between two Gaussians with variance independent of θ \theta θ . A similar argument to the proof used in Theorem 1 can show that the variational objective J J J can also be converted to an objective of the form L γ L_{\gamma} L γ . (Not fully understood yet😢)
8. Relevance to Neural ODE The sampling equation is given by:
x t − 1 = α t − 1 ( x t − 1 − α t ϵ θ ( t ) ( x t ) α t ) ⏟ "predicted x 0 " + 1 − α t − 1 − σ t 2 ⋅ ϵ θ ( t ) ( x t ) ⏟ "direction pointing to x t " + σ t ϵ t ⏟ random noise \bold{x}_{t-1} = \sqrt{\alpha_{t-1}} \underbrace{\left( \frac{\bold{x}_t - \sqrt{1 - \alpha_t} \epsilon_{\theta}^{(t)}(\bold{x}_t)}{\sqrt{\alpha_t}} \right)}_{\text{"predicted } \bold{x}_0"} + \underbrace{\sqrt{1 - \alpha_{t-1} - \sigma_t^2} \cdot \epsilon_{\theta}^{(t)} (\bold{x}_t)}_{\text{"direction pointing to } \bold{x}_t "} + \underbrace{\sigma_t \epsilon_t}_{\text{random noise}} x t − 1 = α t − 1 "predicted x 0 " ( α t x t − 1 − α t ϵ θ ( t ) ( x t ) ) + "direction pointing to x t " 1 − α t − 1 − σ t 2 ⋅ ϵ θ ( t ) ( x t ) + random noise σ t ϵ t The corresponding ODE is:
d x ˉ ( t ) = ϵ θ ( t ) ( x ˉ ( t ) σ 2 ( t ) + 1 ) d σ ( t ) d \bar{\bold{x}}(t) = \epsilon_{\theta}^{(t)} \left( \frac{\bar{\bold{x}}(t)}{\sqrt{\sigma^2(t) + 1}} \right) d \sigma(t) d x ˉ ( t ) = ϵ θ ( t ) ( σ 2 ( t ) + 1 x ˉ ( t ) ) d σ ( t ) where σ = ( 1 − α / α ) \sigma = (\sqrt{1 - \alpha} / \sqrt{\alpha}) σ = ( 1 − α / α ) and x ˉ = ( x / α ) \bar{\bold{x}} = (\bold{x} / \sqrt{\alpha}) x ˉ = ( x / α ) .
We want to derive the ODE form of this sampling equation.
We also want to prove that the ODE form of DDIM is equivalent to the ODE form of VE-SDE.
Proposition 1. The ODE form of DDIM with the optimal model ϵ θ ( t ) \epsilon_{\theta}^{(t)} ϵ θ ( t ) has an equivalent probability flow ODE corresponding to the “Variance-Exploding” SDE in Song et al. (2020).
Proof.
We consider t t t as a continuous, independent “time” variable and x \bold{x} x and α \alpha α as functions of t t t . Let’s consider a reparametrization between DDIM and the VE-SDE by introducing the variables x ˉ \bar{\bold{x}} x ˉ and σ \sigma σ :
x ˉ ( t ) = x ˉ ( 0 ) + σ ( t ) ϵ , ϵ ∼ N ( 0 , I ) \bar{\bold{x}} (t) = \bar{\bold{x}}(0) + \sigma(t)\epsilon , \qquad \epsilon \sim \mathcal{N}(\bold{0}, \bold{I}) x ˉ ( t ) = x ˉ ( 0 ) + σ ( t ) ϵ , ϵ ∼ N ( 0 , I ) for t ∈ [ 0 , ∞ ) t \in [0, \infty) t ∈ [ 0 , ∞ ) and an increasing continuous function σ : R ≥ 0 → R ≥ 0 \sigma: \mathbb{R}_{\geq 0} \rightarrow \mathbb{R}_{\geq 0} σ : R ≥ 0 → R ≥ 0 where σ ( 0 ) = 0 \sigma(0) = 0 σ ( 0 ) = 0 .
We can the define α ( t ) \alpha(t) α ( t ) and x ( t ) \bold{x}(t) x ( t ) corresponding to DDIM case as:
x ˉ ( t ) = x ( t ) α ( t ) \bar{\bold{x}} (t) = \frac{\bold{x}(t)}{\sqrt{\alpha(t)}} x ˉ ( t ) = α ( t ) x ( t ) σ ( t ) = 1 − α ( t ) α ( t ) \sigma(t) = \sqrt{\frac{1 - \alpha(t)}{\alpha(t)}} σ ( t ) = α ( t ) 1 − α ( t ) This also means that:
x ( t ) = x ˉ ( t ) σ 2 ( t ) + 1 \bold{x}(t) = \frac{\bar{\bold{x}}(t)}{\sqrt{\sigma^2(t) + 1}} x ( t ) = σ 2 ( t ) + 1 x ˉ ( t ) α ( t ) = 1 1 + σ 2 ( t ) \alpha(t) = \frac{1}{1 + \sigma^2(t)} α ( t ) = 1 + σ 2 ( t ) 1 which establishes an bijection between ( x , α ) (\bold{x}, \alpha) ( x , α ) and ( x ˉ , σ ) (\bar{\bold{x}}, \sigma) ( x ˉ , σ ) .
Given that α ( 0 ) = 1 \alpha(0) = 1 α ( 0 ) = 1 and the overall step x t = α t x 0 + 1 − α t ϵ \bold{x}_t = \sqrt{\alpha_t} \bold{x}_0 + \sqrt{1 - \alpha_t} \epsilon x t = α t x 0 + 1 − α t ϵ , we have:
x ( t ) α ( t ) = x ( 0 ) α ( 0 ) + 1 − α ( t ) α ( t ) ϵ , ϵ ∼ N ( 0 , I ) \frac{\bold{x}(t)}{\sqrt{\alpha(t)}} = \frac{\bold{x}(0)}{\sqrt{\alpha(0)}} + \sqrt{\frac{1 - \alpha(t)}{\alpha(t)}} \epsilon, \quad \epsilon \sim \mathcal{N}(\bold{0}, \bold{I}) α ( t ) x ( t ) = α ( 0 ) x ( 0 ) + α ( t ) 1 − α ( t ) ϵ , ϵ ∼ N ( 0 , I ) which can be reparametrized into a form that is consistent with VE-SDE:
x ˉ ( t ) = x ˉ ( 0 ) + σ ( t ) ϵ \bar{\bold{x}}(t) = \bar{\bold{x}}(0) + \sigma(t) \epsilon x ˉ ( t ) = x ˉ ( 0 ) + σ ( t ) ϵ Now we derive the ODE forms for both DDIM and VE-SDE and show that they are equivalent.
ODE form for DDIM
The sampling equation can be rewritten into (σ t = 0 \sigma_t = 0 σ t = 0 for DDIM):
x t − Δ t α t − Δ t = x t α t + ( 1 − α t − Δ t α t − Δ t − 1 − α t α t ) ϵ θ ( t ) ( x t ) \frac{\bold{x}_{t - \Delta t}}{\sqrt{\alpha_{t - \Delta t}}} = \frac{\bold{x}_t}{\sqrt{\alpha_t}} + \left( \sqrt{\frac{1 - \alpha_{t - \Delta t}}{\alpha_{t - \Delta t}}} - \sqrt{\frac{1 - \alpha_t}{\alpha_t}} \right) \epsilon_{\theta}^{(t)} (\bold{x}_t) α t − Δ t x t − Δ t = α t x t + ( α t − Δ t 1 − α t − Δ t − α t 1 − α t ) ϵ θ ( t ) ( x t ) which is equivalent to:
x ˉ ( t − Δ t ) = x ˉ ( t ) + ( σ ( t − Δ t ) − σ ( t ) ) ⋅ ϵ θ ( t ) ( x t ) \bar{\bold{x}}(t - \Delta t) = \bar{\bold{x}}(t) + (\sigma(t - \Delta t) - \sigma(t)) \cdot \epsilon_{\theta}^{(t)}(\bold{x}_t) x ˉ ( t − Δ t ) = x ˉ ( t ) + ( σ ( t − Δ t ) − σ ( t )) ⋅ ϵ θ ( t ) ( x t ) Divide both side by (− Δ t - \Delta t − Δ t ) and as Δ t → 0 \Delta t \rightarrow 0 Δ t → 0 , we have:
d x ˉ ( t ) d t = d σ ( t ) d t ϵ θ ( t ) ( x ˉ ( t ) σ 2 ( t ) + 1 ) \frac{\text{d} \bar{\bold{x}} (t)}{\text{d} t} = \frac{\text{d} \sigma(t)}{\text{d} t} \epsilon_{\theta}^{(t)} \left( \frac{\bar{\bold{x}}(t)}{\sqrt{\sigma^2(t) + 1}} \right) d t d x ˉ ( t ) = d t d σ ( t ) ϵ θ ( t ) ( σ 2 ( t ) + 1 x ˉ ( t ) ) ODE form for VE-SDE
Define p t ( x ˉ ) p_t(\bar{\bold{x}}) p t ( x ˉ ) as the data distribution perturbed with σ 2 ( t ) \sigma^2(t) σ 2 ( t ) variance Gaussian noise. The probability flow for VE-SDE is defined as:
d x ˉ = − 1 2 g ( t ) 2 ∇ x ˉ log p t ( x ˉ ) d t \text{d} \bar{\bold{x}} = -\frac{1}{2} g(t)^2 \nabla_{\bar{\bold{x}}} \log p_t(\bar{\bold{x}}) \text{d} t d x ˉ = − 2 1 g ( t ) 2 ∇ x ˉ log p t ( x ˉ ) d t where g ( t ) = d σ 2 ( t ) d t g(t) = \sqrt{\frac{\text{d} \sigma^2(t)}{\text{d} t}} g ( t ) = d t d σ 2 ( t ) is the diffusion coefficient, and ∇ x ˉ log p t ( x ˉ ) \nabla_{\bar{\bold{x}}} \log p_t (\bar{\bold{x}}) ∇ x ˉ log p t ( x ˉ ) is the score of p t p_t p t .
The σ ( t ) \sigma(t) σ ( t ) -perturbed score function ∇ x ˉ log p t ( x ˉ ) \nabla_{\bar{\bold{x}}} \log p_t(\bar{\bold{x}}) ∇ x ˉ log p t ( x ˉ ) is also a minimizer (from denoising score matching):
∇ x ˉ log p t ( x ˉ ) = arg min g t E x ( 0 ) ∼ q ( x ) , ϵ ∼ N ( 0 , I ) [ ∣ ∣ g t ( x ˉ ) + ϵ / σ ( t ) ∣ ∣ 2 2 ] \nabla_{\bar{\bold{x}}} \log p_t(\bar{\bold{x}}) = \argmin_{g_t} \mathbb{E}_{\bold{x}(0) \sim q(\bold{x}), \epsilon \sim \mathcal{N}(\bold{0}, \bold{I})} [||g_t(\bar{\bold{x}}) + \epsilon / \sigma(t)||_2^2] ∇ x ˉ log p t ( x ˉ ) = g t arg min E x ( 0 ) ∼ q ( x ) , ϵ ∼ N ( 0 , I ) [ ∣∣ g t ( x ˉ ) + ϵ / σ ( t ) ∣ ∣ 2 2 ] Given the assumption that ϵ θ ( t ) \epsilon_{\theta}^{(t)} ϵ θ ( t ) is the optimal model, we have:
∇ x ˉ log p t ( x ˉ ) = − ϵ θ ( t ) ( x t ) σ ( t ) = − ϵ θ ( t ) ( x ˉ ( t ) σ 2 ( t ) + 1 ) σ ( t ) \nabla_{\bar{\bold{x}}} \log p_t(\bar{\bold{x}}) = - \frac{\epsilon_{\theta}^{(t)}(\bold{x}_t)}{\sigma(t)} = -\frac{\epsilon_{\theta}^{(t)}(\frac{\bar{\bold{x}}(t)}{\sqrt{\sigma^2(t) + 1}})}{\sigma(t)} ∇ x ˉ log p t ( x ˉ ) = − σ ( t ) ϵ θ ( t ) ( x t ) = − σ ( t ) ϵ θ ( t ) ( σ 2 ( t ) + 1 x ˉ ( t ) ) Plug this into the probability flow for VE-SDE:
d x ˉ ( t ) = 1 2 d σ 2 ( t ) d t ϵ θ ( t ) ( x ˉ ( t ) σ 2 ( t ) + 1 ) σ ( t ) d t ⇔ d x ˉ ( t ) d t = d σ ( t ) d t ϵ θ ( t ) ( x ˉ ( t ) σ 2 ( t ) + 1 ) \begin{align*}
\text{d} \bar{\bold{x}}(t) &= \frac{1}{2} \frac{\text{d}\sigma^2(t)}{\text{d}t} \frac{\epsilon_{\theta}^{(t)}(\frac{\bar{\bold{x}}(t)}{\sqrt{\sigma^2(t) + 1}})}{\sigma(t)} \text{d} t \\
\Leftrightarrow \frac{\text{d} \bar{\bold{x}}(t)}{\text{d} t} &= \frac{\text{d} \sigma(t)} {\text{d} t}\epsilon_{\theta}^{(t)}(\frac{\bar{\bold{x}}(t)}{\sqrt{\sigma^2(t) + 1}})
\end{align*} d x ˉ ( t ) ⇔ d t d x ˉ ( t ) = 2 1 d t d σ 2 ( t ) σ ( t ) ϵ θ ( t ) ( σ 2 ( t ) + 1 x ˉ ( t ) ) d t = d t d σ ( t ) ϵ θ ( t ) ( σ 2 ( t ) + 1 x ˉ ( t ) ) The is exactly the same as the ODE form for DDIM. In both cases the initial conditions are x ˉ ( T ) ∼ N ( 0 , σ 2 ( T ) I ) \bar{\bold{x}} (T) \sim \mathcal{N}(\bold{0}, \sigma^2(T) \bold{I}) x ˉ ( T ) ∼ N ( 0 , σ 2 ( T ) I ) , so the resulting ODEs are identical.
By discretizing the above equation we have:
x ˉ ( t − Δ t ) − x ˉ ( t ) = 1 2 ( σ 2 ( t − Δ t ) − σ 2 ( t ) ) ⋅ 1 σ ( t ) ⋅ ϵ θ ( t ) ( x ˉ ( t ) σ 2 ( t ) + 1 ) x t − Δ t α t − Δ t − x t α t = 1 2 ( 1 − α t − Δ t α t − Δ t − 1 − α t α t ) ⋅ α t 1 − α t ϵ θ ( t ) ( x t ) ⇒ x t − Δ t α t Δ t = x t α t + 1 2 ( 1 − α t − Δ t α t − Δ t − 1 − α t α t ) ⋅ α t 1 − α t ϵ θ ( t ) ( x t ) \begin{align*}
\bar{\bold{x}}(t - \Delta t) - \bar{\bold{x}}(t) &= \frac{1}{2} (\sigma^2(t - \Delta t) - \sigma^2(t)) \cdot \frac{1}{\sigma(t)} \cdot \epsilon_{\theta}^{(t)} (\frac{\bar{\bold{x}}(t)}{\sqrt{\sigma^2(t) + 1}}) \\
\frac{\bold{x}_{t - \Delta t}}{\sqrt{\alpha_t - \Delta t}} - \frac{\bold{x}_t}{\sqrt{\alpha_t}} &= \frac{1}{2} \left( \frac{1 - \alpha_{t - \Delta t}}{\alpha_{t - \Delta t}} - \frac{1 - \alpha_t}{\alpha_t} \right) \cdot \sqrt{\frac{\alpha_t}{1 - \alpha_t}} \epsilon_{\theta}^{(t)} (\bold{x}_t) \\
\Rightarrow \qquad \frac{\bold{x}_{t - \Delta t}}{\sqrt{\alpha_t \Delta t}} &= \frac{\bold{x}_t}{\sqrt{\alpha_t}} + \frac{1}{2} \left( \frac{1 - \alpha_{t - \Delta t}}{\alpha_{t - \Delta t}} - \frac{1 - \alpha_t}{\alpha_t} \right) \cdot \sqrt{\frac{\alpha_t}{1 - \alpha_t}} \epsilon_{\theta}^{(t)} (\bold{x}_t)
\end{align*} x ˉ ( t − Δ t ) − x ˉ ( t ) α t − Δ t x t − Δ t − α t x t ⇒ α t Δ t x t − Δ t = 2 1 ( σ 2 ( t − Δ t ) − σ 2 ( t )) ⋅ σ ( t ) 1 ⋅ ϵ θ ( t ) ( σ 2 ( t ) + 1 x ˉ ( t ) ) = 2 1 ( α t − Δ t 1 − α t − Δ t − α t 1 − α t ) ⋅ 1 − α t α t ϵ θ ( t ) ( x t ) = α t x t + 2 1 ( α t − Δ t 1 − α t − Δ t − α t 1 − α t ) ⋅ 1 − α t α t ϵ θ ( t ) ( x t )
9. Parameter Setting of DDIM DDIM is better than DDPM:
Generation speeds up 10x to 100x. Once the initial x T \bold{x}_T x T is fixed, DDIM retains high-level image features. No changes are needed with regards to the training procedure. The only changes made is how to produce samples from the model . This is achieved by controlling τ \tau τ (which controls how fast the samples are obtained) and σ \sigma σ (which interpolates between the deterministic DDIM and the stochastic DDPM).
We consider different sub-sequences τ \tau τ of [ 1 , … , T ] [1, \dots, T] [ 1 , … , T ] and different hyperparameters σ \sigma σ indexed by elements of τ \tau τ . To simplify comparisons, we consider σ \sigma σ with the form:
σ τ i ( η ) = η ( 1 − α τ i − 1 ) / ( 1 − α τ i ) 1 − α τ i / α τ i − 1 \sigma_{\tau_i}(\eta) = \eta \sqrt{(1 - \alpha_{\tau_{i-1}}) / (1 - \alpha_{\tau_i})} \sqrt{1 - \alpha_{\tau_i} / \alpha_{\tau_{i-1}}} σ τ i ( η ) = η ( 1 − α τ i − 1 ) / ( 1 − α τ i ) 1 − α τ i / α τ i − 1 where η ∈ R ≥ 0 \eta \in \mathbb{R}_{\geq 0} η ∈ R ≥ 0 is a hyperparameter that can be directly controlled. This includes an original DDPM generative process when η = 1 \eta = 1 η = 1 and DDIM when η = 0 \eta = 0 η = 0 .
The sub-sequence τ \tau τ is selected given dim ( τ ) < T \text{dim}(\tau) < T dim ( τ ) < T :
Linear \bold{\text{Linear}} Linear : τ i = ⌊ c i ⌋ \tau_i = \lfloor ci \rfloor τ i = ⌊ c i ⌋ Quadratic \bold{\text{Quadratic}} Quadratic : τ i = ⌊ c i 2 ⌋ \tau_i = \lfloor ci^2 \rfloor τ i = ⌊ c i 2 ⌋ The constant value c c c is selected that the last element τ − 1 \tau_{-1} τ − 1 is close to T T T .